Using modern tooling to find a tough Linux kernel bug

I solved a pretty interesting bug in the Linux kernel livepatch subsystem earlier this year. The issue was this: when rolling out a livepatch to the fleet, we would detect a spike in TCP retransmit events while the livepatch was being applied. Let’s start by discussing the basics of livepatching and TCP retransmits in the kernel, and then I’ll take you through how I root caused the bug.

Background on KLPs and TCP Retransmits

KLP

Kernel livepatches (KLPs) are a mechanism for updating kernel code in real time on a running system. Livepatching allows you to ship updates to a kernel without having to reboot the system, and can be useful when you have to quickly roll out a patch to your fleet.

The Linux kernel’s Livepatch documentation page has much more detailed information, but at a high level, this is how a KLP is applied:

  1. The patch application process is entered by “inserting” the patch into the kernel as a module via insmod .
  2. The KLP may reference symbols (which can be both exported, or non-exported) in other modules, or in vmlinux (the main kernel binary). The module has no way of knowing the addresses of these symbols before it’s loaded, so relocations are performed on the module when the patch is being applied so that it points to all the correct addresses in the live kernel. This requires looking up the addresses of any referenced symbols via the referenced modules' symbol tables.
  3. Each task (i.e. thread) in the system is “transitioned” to using the new patch. Transitioning is simply a process of verifying that the task doesn't have any patched functions in its callstack. If it does, the livepatching subsystem schedules some work to re-check the task again at a later time.
  4. Once each task is fully transitioned, the patch is “enabled”, and the insmod call is completed.

This is a simplification, but hopefully gives you at least a high-level idea of what livepatching is. For now, just know that it’s a way of patching kernel code while the kernel is running, and it requires a lot of dark magic to work (some of which we’ll look at more closely below – dark magic is only interesting if it’s brought into the light).

TCP Retransmits

I’ll assume that you’re familiar with TCP. And if you’re familiar with TCP, you’re probably aware that it provides reliable, in-order communication between two network sockets. Of course, the Internet is a wild and chaotic place, and many things can go wrong in the network sitting between those two sockets. A router could go down, the network could get congested because Kanye tweeted something, etc. To account for such scenarios, the TCP protocol stipulates that a socket should retransmit a packet if the peer socket has not acknowledged receipt of that packet in a timely manner.

In the Linux kernel, this retransmission happens in a function called sk_retransmit_skb(). Given its importance to the networking subsystem, sk_retransmit_skb() is defined as a networking trace event. Like all kernel trace events, this allows users to hook into and analyze the event using tools such as BPF (or more conveniently something like bpftrace or funccount), kernel probes (kprobes), and perf. All of these tools have a very rich feature ecosystem. Reading through these links isn't a pre-requisite for the rest of the article, but may be useful as a reference or for you to peruse at your leisure some other time.

Detecting the retransmits

Now that we have some context on the systems that are relevant to this debugging investigation, let’s quickly touch on how the retransmits were observed in the first place. As I mentioned above, the bug was manifesting itself as a spike in TCP retransmits. This was observed using funccount-bpfcc:

# funccount-bpfcc -T -i 1 't:tcp:tcp_retransmit_skb'
Tracing 1 functions for "b't:tcp:tcp_retransmit_skb'"... Hit Ctrl-C to end.
                                                                                                                                    
FUNC                                    COUNT                                                                                 09:58:44
tcp:tcp_retransmit_skb                   12

09:58:45
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   42

09:58:46
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   31

09:58:47
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   3631

09:58:48
FUNC                                    COUNT
tcp:tcp_retransmit_skb                   5185

09:58:49
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   653

09:58:50
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   19

09:58:51
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   18

^C
Detaching…

Let’s talk about how we invoked funccount-bpfcc, and then we’ll discuss what it’s telling us. The -i 1 flag tells funccount that we want to count the number of events that occur every 1 second, and the -T flag tells it to print the current time when it shows us a sample. Finally, the 't:tcp:tcp_retransmit_skb' string tells it that we want a count of the tcp_retransmit_skb event in the tcp event namespace. If you didn’t read the Linux kernel event page, this just means that there’s a “TCP retransmit” event hook in the kernel that we can hook into to count the number of times the function is invoked, print stacktraces, print arguments, and do many other useful things.

Looking at the counts, we see that at first there was a period of calm where we only had low double digit numbers of TCP retransmit events, indicating a normal workload for the server in question. Then, there is a 100x + bump in TCP retransmits between 09:58:47 - 09:58:49, followed by a quick return to normalcy. 100x bump. Ouch.

Note that associating the TCP retransmits with applying a KLP was simply a matter of recording when the KLP was being applied, and investigating dmesg to confirm when the kernel was applying the patch:

[ 35927.457289] livepatch: enabling patch 'my_kernel_livepatch'
[ 35928.460032] livepatch: 'livepatch_sample': starting patching
transition
[ 35929.411708] livepatch: 'my_kernel_livepatch': patching complete

Let's figure out what's going on here.

Debugging the cause of the TCP retransmits

Step 1: Profiling the system to see what’s going on

My first thought when debugging this was to see where we were spending most of our time during the patch application. perf makes this dead simple:

# perf record --a -g --all-kernel -- kpatch load --all
loading patch module: /var/lib/kpatch/.../my_patch.ko
waiting (up to 15 seconds) for patch transition to complete...
transition complete (2 seconds)
[ perf record: Woken up 45 times to write data ]
[ perf record: Captured and wrote 16.260 MB perf.data (81590 samples) ]
# ls
perf.data

Like we did for funccount-bpfcc, let’s discuss how we’re invoking the command, and then we’ll take a look at what it’s telling us. perf record is a perf subcommand that collects a “profile” on the system. There are many things that “profile” could mean, such as collecting some performance counters. In our case we passed -g, which directs perf to record callstacks. -a tells perf to profile all of the CPUs on the system rather than a particular subset of CPUs. We don’t yet know what’s happening on the system that’s causing this, so let’s cast as wide of a net as we can to get a high level picture of what’s going on. --all-kernel means that we want to see what’s going on in kernel space. The implication here is of course that perf can also be used to profile userspace programs. Because we’re observing TCP retransmits during a kernel livepatch transition, chances are much higher that this is a bug in kernel space, so let’s start there. Finally, -- kpatch load --all is the command that perf runs, which applies all active livepatches on the system.

Let’s take a look at the profile that perf collected:

# perf report
Processing events...
-   40.84%  swapper          [kernel.vmlinux] [k] intel_idle
   - 0xffffffff...
      - 39.78% start_secondary
           cpu_startup_entry
           do_idle
           cpuidle_enter
           cpuidle_enter_state
           intel_idle
      - 1.06% start_kernel
           cpu_startup_entry
           do_idle
           cpuidle_enter
           cpuidle_enter_state
           intel_idle
-   10.16%  insmod           [kernel.vmlinux] [k] kallsyms_expand_symbol.constprop.1
     0x20646...
     syscall
     entry_SYSCALL_64
     do_syscall_64
     __do_sys_finit_module
     load_module
     do_init_module
     do_one_initcall
     0xffffffff...
     klp_enable_patch
   - klp_init_object_loaded
      - 9.75% klp_resolve_symbols.isra.11
           klp_find_object_symbol
         - kallsyms_on_each_symbol
              kallsyms_expand_symbol.constprop.1
+    1.31%  systemctl        [kernel.vmlinux] [k] __lock_acquire.isra.36
+    1.27%  swapper          [kernel.vmlinux] [k] __lock_acquire.isra.36
+    0.67%  swapper          [kernel.vmlinux] [k] lock_release
+    0.65%  systemctl        [kernel.vmlinux] [k] lock_release
+    0.62%  chef-client      [kernel.vmlinux] [k] release_pages
+    0.60%  python3.8        [kernel.vmlinux] [k] __lock_acquire.isra.36
+    0.56%  chef-client      [kernel.vmlinux] [k] page_remove_rmap
+    0.54%  swapper          [kernel.vmlinux] [k] check_preemption_disabled
+    0.54%  systemctl        [kernel.vmlinux] [k] native_sched_clock
+    0.51%  systemctl        [kernel.vmlinux] [k] check_preemption_disabled
...

The profile shows a descending list of the hottest functions in the system. The top line shows that the machine is pretty idle at 40.84%. That doesn’t tell us much, but the next line shows that 10.16% of the time was spent in insmod (where the patch is applied).

perf says that insmod is spending most of its time calling kallsyms_on_each_symbol(). Let’s take a closer look at that function:

/*
 * Iterate over all symbols in vmlinux.  For symbols from modules use
 * module_kallsyms_on_each_symbol instead.
 */
int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
                      unsigned long),
                void *data)
{
    char namebuf[KSYM_NAME_LEN];
    unsigned long i;
    unsigned int off;
    int ret;

    for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
        off = kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
        ret = fn(data, namebuf, NULL, kallsyms_sym_address(i));
        if (ret != 0)
            return ret;
    }
    return 0;
}

The function is looping over all of the symbols in vmlinux and invoking some caller-supplied function with the symbol. That certainly seems like something that could take a while and consume a lot of CPU, but interrupts are enabled throughout the function so packets shouldn’t have any trouble being delivered to the NIC. It doesn’t look like the caller has disabled interrupts either. Looks like this must not be the issue. All the same, this is our only data point thus far, so let’s bookmark this code in case we need to come back to it later.

Step 2: Figure out where interrupts are being disabled

kallsyms_on_each_symbol() was taking up a lot of time on the CPU, but interrupts were enabled the whole time so it doesn’t seem like the reason that TCP packets would be dropped. Perhaps there is another code path in the livepatch subystem where interrupts are disabled for a long period of time. Looking through the livepatch source code, it looks like we're disabling interrupts when we try to transition a task (i.e. a thread), to using the function in the patch. This is happening in klp_try_switch_task(), which itself uses task_call_func() to invoke klp_check_and_switch_task().

Looking at the implementation of task_call_func(), interrupts are disabled before the user-supplied function klp_check_and_switch_task() is invoked, so if klp_check_and_switch_task() is taking a long time then that could potentially be the cause of the issue. klp_check_and_switch_task() is iterating over each function in every object being patched by the KLP, and walks the task’s stack to see if the function is present at any stack frame. If any function is present in the stack trace, the task is not ready to be transitioned. If the stack trace is clear of any patched functions, it can be transitioned. That seems like it could qualify as a non-trivial amount of work.

What we need to do now is measure how much time we’re spending in this region with interrupts disabled. Let’s use bpftrace:

bpftrace -e \
'kprobe:klp_enable_patch {
    @klp_cpu = cpu;
    @klp_started = true;
}
kprobe:task_call_func+0x39 {
    if (@klp_started && @klp_cpu == cpu) {
        @stash[cpu] = nsecs;
    }
}
kprobe:task_call_func+0x93 {
    if (@klp_started && klp_cpu == cpu) {
        $delta = nsecs - @stash[cpu];
        @ = hist($delta);
    }
}
END {
    clear(@stash);
    clear(@klp_started);
}'

It looks like there’s a lot going on here, but we’ll break it down line-by-line. The bpftrace reference guide is a great resource to learn more, and can be referenced while you're reading through this article.

The -e flag precedes a string containing the bpftrace program that we want to run. There are other flags such as having it direct the output to a file, run a trace on a specific process, etc, but we want it to run for the whole system until we have more information so we’ll leave those flags off for now.

The probes themselves are pretty simple. kprobe:klp_enable_patch runs our probe when klp_enable_patch() is invoked in the kernel. klp_enable_patch() is the entrypoint for applying a KLP, so it’s the logical place for us to bootstrap the rest of our bpftrace script. @klp_cpu = cpu; records the CPU on which the patch is being applied in a global variable that can be accessed in other probes. This will be useful to ensure that we only collect data on the CPU that insmod is executing on, rather than also collecting data points from other CPUs which will muddy the data. @klp_started = true; similarly sets a global flag that indicates that the patch has started to be applied, and will be useful to ensure that we don’t collect any datapoints before we enter the livepatch subsystem. Note that all global variables are by default falsey, so only checking that @klp_cpu == cpu is insufficient as we would accidentally collect data for events on CPU 0 before klp_enable_patch() had been invoked.

Moving on, kprobe:task_call_func+0x39 means, “run a kprobe starting at an offset of 0x39 bytes into task_call_func(). 0x39 is the offset of the first instruction of task_call_func() for which interrupts are disabled. @stash[cpu] = nsecs; specifies that the probe should store the current timestamp in a map called stash. Let’s take a look at how we’ll use these variables in the next probe.

kprobe:task_call_func+0x93, in complementary fashion to the first task_call_func probe, occurs 0x93 bytes into task_call_func() at the last instruction where interrupts are disabled. $delta = nsecs - @stash[cpu]; stores the difference in nanoseconds between the time stored in the first probe, and the time in the current probe, in a local variable named delta. Given that the two probes are the first and last instructions respectively where interrupts are disabled, $delta therefore stores the total amount of time in nanoseconds that interrupts were disabled. @ = hist($delta); creates a global variable storing a histogram of how long was spent for various invocations. The variable need not be named – making it global is enough for bpftrace to print it out when the script exits.

The last probe, END { clear(@stash); clear(@klp_started); }, simply clears out two global variables that we don’t want to be printed when bpftrace exits. This isn’t strictly necessary, but avoids cluttering the output with text we don’t care about.

Let’s take a look at the output from running the script:

# bpftrace -e ' … '
Attaching 4 probes...
@:
[1K, 2K)              66 |                                      |
[2K, 4K)            4530 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)              75 |                                      |
[8K, 16K)             11 |                                      |
[16K, 32K)             7 |                                      |
[32K, 64K)             4 |                                      |
[64K, 128K)            1 |                                      |

@klp_cpu: 24

The histogram contains buckets with a range of values, along with the number of times that a sample fell into each bucket. We also see that insmod ran on CPU 24. The histogram tells us that the vast majority of samples took no more than 4 microseconds to run. There were some outliers that were on the order of 64-128 microseconds, but nothing that should be anywhere close to long enough to cause thousands of TCP retransmits due to interrupts not being delivered.

In case you’re curious, the default timeout value for when to send a retransmit is stored in the TCP_TIMEOUT_INIT macro, which is defined to 1 second as specified in RFC6298 2.1:

#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))

So even in the outlier of 128k microseconds, we’re several orders of magnitude beneath the threshold that should cause us to send a TCP retransmission.

Something else is amiss…what to do next?

Step 3: Look at how the kernel is polling for packets

It seems that we can’t identify a region where interrupts are disabled for long enough to prevent packets from being delivered to the NIC. Let’s think about the next step in the pipeline. The NIC receives a packet (via an interrupt being delivered to the driver), but is the kernel actually getting the packet from the NIC? To answer this we’ll have to inspect the behavior of the NAPI packet processing framework.

The NAPI (i.e. “new” API that’s not so new at this point as it was introduced to Linux 2.6 in 2003) is a framework that allows for more performant network packet processing. Rather than having an interrupt be delivered to the NIC for all incoming packets, the driver instead receives only a single interrupt, and then rather than processing that interrupt, just tells the networking subsystem that packets are ready for processing. The networking subsystem will then poll the driver for packets at a convenient later time, for some highly subjective definition of “convenient”. In contrast with the classical approach to network packet handling wherein a separate interrupt is delivered for each packet, the NAPI approach prevents interrupt storms from causing interrupt handlers to hog the CPU, and allows the networking subsystem to do clever things such as batch packet delivery and schedule packet processing when the CPU isn’t busy.

With that in mind, let’s see what the NAPI networking subsystem is doing to get a better understanding of why those packets aren’t seemingly being delivered. The t:napi:napi_poll kernel event is the event that’s fired when the networking subsystem polls for packets from a networking driver. The event is defined as follows:

TRACE_EVENT(napi_poll,

    TP_PROTO(struct napi_struct *napi, int work, int budget),

    TP_ARGS(napi, work, budget),

    TP_STRUCT__entry(
        __field(    struct napi_struct *,    napi)
        __string(    dev_name, napi->dev ? napi->dev->name : NO_DEV)
        __field(    int,            work)
        __field(    int,            budget)
    ),

    TP_fast_assign(
        __entry->napi = napi;
        __assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
        __entry->work = work;
        __entry->budget = budget;
    ),

    TP_printk("napi poll on napi struct %p for device %s work %d budget %d",
          __entry->napi, __get_str(dev_name),
          __entry->work, __entry->budget)
);

The work argument in the event indicates how many packets are being serviced for this poll event, going up to a maximum of 64 for any instance of the event. A trace like the following which aggregates the stacktrace and the amount of work for that event, seems like it would be useful:

bpftrace -e \
'kprobe:klp_enable_patch {
    @klp_cpu = cpu;
    @klp_started = true;
}
t:napi:napi_poll {
    if (@klp_started && klp_cpu == cpu) {
        @napi_stacks[kstack, args->work] = count();
    }
}
END {
    clear(@klp_started);
}'

Before we try running it, let’s quickly go over the parts of the script that we haven’t seen before. kstack collects the stack trace for the currently executing thread in kernel space, and args->work extracts the number of packets being processed. So the script will create a tuple of (kstack x number of packets), and count how many instances of each tuple occurs. This gives us the following output:

# bpftrace -e ' … '
Attaching 3 probes...
@cpu_invoked: 19
...
@napi_stacks[
    net_rx_action+673
    net_rx_action+673
    __softirqentry_text_start+194
    run_ksoftirqd+26
    smpboot_thread_fn+184
    kthread+267
    ret_from_fork+31
, 64]: 45
...

There were a lot of stack traces that I clipped out, but the one I included is very interesting. It shows that on 45 separate occasions, when ksoftirqd was running, there were at least 64 outstanding packets. ksoftirqd is a per-CPU kernel thread that runs when softirqs are taking up too much CPU time, and need to be processed asynchronously. It’s odd that there would be so many outstanding packets.  Is something preventing ksoftirqd from running, and thus preventing some softirqs (packets in this case) from being delivered?

Let’s see if we can use perf to measure how long it’s taking to schedule ksoftirqd on the same core as insmod:

# export cpu=1
# perf record -e 'sched:*' -C $cpu taskset -c $cpu kpatch load --all
loading patch module: /var/lib/.../my_patch.ko
waiting (up to 15 seconds) for patch transition to complete...
transition complete (2 seconds)

# ls
perf.data

# perf script

[...]
insmod 2530776 [001] 53287.660039:
sched:sched_waking: comm=ksoftirqd/1 pid=16 prio=120 target_cpu=001
          insmod 2530776 [001] 53287.660041:
sched:sched_stat_runtime: comm=insmod pid=2530776 runtime=202945 [ns]
vruntime=219869947880235 [ns]
          insmod 2530776 [001] 53287.660041:
sched:sched_wakeup: comm=ksoftirqd/1 pid=16 prio=120 target_cpu=001
[...]
 kworker/1:0-eve 2401391 [001] 53289.050664:
sched:sched_switch: prev_comm=kworker/1:0 prev_pid=2401391
prev_prio=120 prev_state=I ==> next_comm=ksoftirqd/1 next_pid=16
next_prio=120

Aha! ksoftirqd was woken up at 53287.660039, and then wasn’t actually scheduled on the CPU until 53289.050664, almost 1.4 seconds later. So that’s 1.4 seconds that ksoftirqd wasn’t able to service softirq’s. That’s sort of the equivalent of interrupts being disabled for 1.4 seconds, so this has a very high likelihood of being the issue.

But why is ksoftirqd being starved? The answer is that the kernel is compiled to be a non-preemptible kernel. At a high level, this means that a thread running in the kernel cannot be preempted and descheduled to run other threads. Note that this is different from being interrupted. A non-preemptible kernel can be interrupted via a hardware interrupt, but it cannot be preempted and have kthreads be descheduled. If we look at our config options for the kernel, we indeed see that it was compiled with CONFIG_PREEMPT_NONE=y:

# cat /boot/config-`uname -r` | grep PREEMPT_NONE
CONFIG_PREEMPT_NONE=y

Step 4: Fixing the bug

Our (very strongly supported) theory is that ksoftirqd is being starved of a CPU, because insmod is running on a non-preemptible kernel, and is hogging the CPU. To fix it, we need some way for insmod to manually yield the CPU to other kthreads. cond_resched() is a lightweight function in the Linux kernel that yields the CPU from the executing task, and allows the scheduler to run another task that is waiting in the CPU’s runqueue. This sounds like exactly what we need, so let’s give it a shot.

But where should we add it? Remember when I told you to bookmark kallsyms_on_each_symbol() because it was our only data point? That was by far the hottest codepath in insmod, so let’s try adding it there!

 --
 kernel/kallsyms.c | 1 +
 kernel/module.c   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 0ba87982d017..2a9afe484aec 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -223,6 +223,7 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
         ret = fn(data, namebuf, NULL, kallsyms_sym_address(i));
         if (ret != 0)
             return ret;
+        cond_resched();
     }
     return 0;
 }
diff --git a/kernel/module.c b/kernel/module.c
index 40ec9a030eec..c96160f7f3f5 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4462,6 +4462,8 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, const char *,
                  mod, kallsyms_symbol_value(sym));
             if (ret != 0)
                 goto out;
+
+            cond_resched();
         }
     }
 out:
-- 
2.30.2

This patch adds a cond_resched() to the symbol lookup path both for modules, and for vmlinux (i.e. the main kernel). And voila! With this patch, we no longer observe the TCP retransmits:

# funccount-bpfcc -T -i 1 't:tcp:tcp_retransmit_skb
Tracing 1 functions for "b't:tcp:tcp_retransmit_skb'"... Hit Ctrl-C to end.
                                                                                                                                    
FUNC                                    COUNT                                                                                 11:01:52
tcp:tcp_retransmit_skb                   35

11:01:53
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   31

11:01:54
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   42

11:01:55
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   11

11:01:56
FUNC                                    COUNT
tcp:tcp_retransmit_skb                   15

11:01:57
FUNC                                    COUNT                     
tcp:tcp_retransmit_skb                   44

Detaching…

The patch was sent out to the Linux kernel here: https://lore.kernel.org/lkml/20211229215646.830451-1-void@manifault.com/, and was merged into Linux kernel version 5.17.

Takeaways

We covered a lot of ground in this post. We learned about KLP and TCP retransmits, kernel events, BPF, bpftrace, funccount-bpfcc, kprobes, perf, NAPI, cond_resched(), and more. There is a lot more to all of these tools than what we covered, and almost all of it is well documented online. Also, let’s take a minute to appreciate how great this tooling is. Consider that during our debugging journey, we didn’t have to compile anything until we were testing out our fix! That’s pretty amazing considering that we were debugging and inspecting a live-running kernel.

I hope you enjoyed following this debugging effort. My next upcoming series will likely be an introduction to virtual memory, and will discuss more fundamental systems engineering concepts ranging from how virtual memory works, to how pointers work in C, to eventually possibly discussing slightly more advanced concepts such as caching. Tune in if you’re interested, and feel free to let me know if there’s something else you’d like me to cover.

Comments

Sign in or become a Byte Lab member to join the conversation.