NLM's Garbage Collection Race
This is the first in (hopefully!) a series of blog posts where I'll talk about some of the cooler bugs I've run into here at Delphix. Let's start with some background to explain some things that will be handy to know.
Programs occasionally need to synchronize access to files. When all the accesses take place in one process, programs can use concurrency primitives like binary semaphores to coordinate. When it's multiple processes, possibly on multiple machines, a better mechanism is needed.
This lead to the creation of "file locks", which are basically filesystem-level locking of regions of files. A couple syscalls for this were standardized into POSIX 2001. The reason this matters to us at Delphix is that Oracle (and possibly other database software as well) uses this system for the Oracle control file.
At Delphix, we make extensive use of NFS. It's one of the driving forces behind our product. And yet, NFSv3 came out in 1995, before file locking had been standardized. As a result, NFSv3 has no native mechanism to deal with file locks.1
Any operating systems that want to provide file locking capability for NFSv3 have to implement the standard protocol called NLM. In Solaris, the kernel module that does this is called klmmod. Finally, any good debugging story is going to involve debugging tools. In solaris, the kernel debugger is one called The Modular Debugger (mdb). It doesn't do source-level debugging, but it's very good at certain things.
One of those things is inspecting the state of a giant complex system when it has stopped moving, like in a kernel panic. You don't need to know very much about mdb for the purposes of this blog post 2 but the basics are as follows: the prompt for a command is a >, and anything in this color is a command being typed in to mdb. "::regs" means "dump the registers of the current thread", "::stack" is "give me a stack trace", and anything of the form "1badd00d::foo" means "apply command foo to the stuff at address 0x1badd00d."
While preparing to start an upgrade, a member of the support team discovered that the Delphix Engine was rebooting when no one told it to. After the reboot, he found a crash dump. That crash dump shows that the panic happened in nlm_gc. NLM has its own system of host management, independent of NFS's.
It has a list of idle hosts, which the host gets put onto after people are done manipulating it. After a fixed time on the idle list, the host is freed (if it has no file locks held; if it does, we know they'll probably be back to clean them up eventually). This is presumably to reduce needless allocations when a host is in heavy use. This function encountered a bogus mutex, tripped an assert and panicked:
Our support guy grabbed a couple more pieces of information before handing off the crash dump to a colleague of mine George Wilson, who does a lot of work in the kernel (enough that he's the default assignee for OS bugs, which is probably why this went to him at first), though more specifically is heavily involved in ZFS:
This grabs the stack trace with arguments, and analyzes the mutex passed into mutex_vector_enter (a.k.a. mutex_lock) to confirm that it is indeed bogus. Why did we pass in a bogus mutex, and where did it come from? George took a closer look. First, we have to take a look at the list of idle hosts, to see what nlm_gc is working on. nlm_gc takes a struct nlm_globals *, which is how Solaris encapsulates global data for a specific module.
Now lets take a look at the lock for the first entry on the list:
That's the one! Now, what does that "opaque" field mean? Normally, mutexes contain address of the kthread_t for the thread who owns the mutex. For those who know how memory allocation works, you know that objects are usually allocated so that their last digit is 0 or 8,3 and big objects like thread control blocks basically always start at 0.
So 6 is a strange starting address for an allocation. What gives? Did someone corrupt the host? This is the point where George passed off the bug to Delphix's resident expert on NLM, Jeremy Jones.
At the time, Jeremy was my manager, and he knew I'd been itching to get my hands on some interesting kernel work. He offered to take a look at it with me. There was no way I was going to say no to a chance to debug a kernel panic in a real kernel! A sensible first step seems to be to try to understand why that mutex is the way it is.
Could it have happened without random memory corruption, which would be incredibly annoying to debug? When we dig in to mutex.c to try to understand the system, we see something very interesting:
Ah ha! Mutexes get OR'd with 6 when they're destroyed. The rest of the pointer, then, should be a kthread_t.
And the nlm_globals struct has a field that stores the nlm_gc thread:
That sure looks like the same thread. So the host at the front of the idle list was already destroyed by the gc thread (further evidence includes that the nh_flags field in the host struct doesn't include the INIDLE flag, which it should)! How could that happen? Here is the rest of the idle list.
That's clearly an infinite loop. However, the link is walkable in the reverse direction:4
Here we see what's happened; somehow, from the tail, the link looks reasonable until we get to the head, and the head's pointers are both pointing to some other element. When we examine that element, we see something very odd:
Its link pointers are also both pointing to itself. This is clearly a very strange state; unless there was some random pointer accesses (unlikely), someone must have done something improper with the linked list. After some extensive whiteboarding, we discovered you can achieve this effect either by removing an already removed element or by inserting something to the tail of the list that is already at the tail of the list.5
However, only the former can produce this effect and leave sane-looking elements at the tail of the list. We concluded the former must have happened at some point; someone removed something from the linked list that was already removed.
This could be caused by a bug in the refcounts (possible, but unlikely), but after reading the code, Jeremy found another good candidate: in the gc, after calling nlm_host_has_locks, we check if it returned non-zero (i.e. the host has some locks held) or if the host has some references (i.e. someone else grabbed a reference to it while we were working).
In either case, we remove it from the idle list and update the timeout. However, if someone has grabbed a reference to it, they have (in all relevant cases) already removed it from the idle list! This looks like our double remove.
The window is pretty small; while we're checking for locks someone has to grab a reference, and then after they drop the global lock, but before they drop the reference, we have to finish seeing if they have locks, grab the global lock, and move on.
Proving this can indeed cause a panic involved a few steps: First, enable QUEUEDBEUG for nlm_impl.c (and related files). This causes the fields of a removed node to get set to 1 after removal; if we hit the bug again, we should kernel panic immediately, with a clear page fault.
Second, increase the gc frequency to one second (and decrease the idle timeout at the same time). This is easy to do by launching lockd with -t 1. Third, add some delays to the kernel to increase the odds of hitting the bug. We added a ten second delay after unlocking the host lock after has_locks, and before grabbing the global lock.
We also added a 30 second delay in host_findcreate in the case where an existing host is found. This is the one place that the lock and unlock code paths use to find hosts. Fourth, create a test program that locks a file, unlocks it and repeats.
Between the lock and unlock, we wait 2 seconds, and between unlock and relock we wait 3. The idea, then is that the following pattern will emerge:
When we run this test, indeed we get a kernel panic. On examination of the crash log, we see the following:
We've panicked in the gc. Let's look at the code at that address:
Reading assembly is not everyone's strong suit, so I'll annotate it:
That should make things easier to understand, but here it is in C code (instructions 0x14e though 0x163):
Basically, we're just doing some sanity checks before removing from the list. Let's look at our registers:
%rdx contains the value of one of the fields of the current host's linked list structure, and it's 1! That means we are removing a host that was already removed.
The fix is simple; we should not remove a host from the idle list if nh_refs is non-zero, because someone else has done that already. We also added an assert to make sure that hosts that are being unregistered are on the idle list. Here's the diff (with documentation changes dropped for brevity):
This patch hasn't made its way into illumos just yet, but it should come up for review soon.
That was my first real kernel bug at Delphix; I've now taken over responsibility for maintaining NLM. Let me know what you thought of this blog post, and how I could improve future entries in the series. You can either leave a comment here, or email me at firstname.lastname@example.org. Happy bug hunting!
- Interestingly, NFSv4 does have this capability. If we used that, this blog post probably never would have happened. We don't use it for the same reason as most people: it's less well tested than NFSv3, because it's not in common use. This creates something of a vicious cycle.
- If you are interested in knowing more, there's a chapter from a textbook here. But basically, the coolness is that in addition to the standard "I know the addresses of all the functions so I can set breakpoints", mdb also knows about the layout of all the structs, the addresses of all the global variables, and you can write custom logic to parse and present that information in useful and simple ways. For example, I wrote a couple of mdb commands that walk through the list of nlm hosts and print out the address, hostname, vnode, sysid, PID, state, and lock type all the file locks being held. It also lets you build these on the fly by piping together the output of various commands.-
- Specifically, things are usually allocated so that they pack together nicely. 4-byte things are aligned to 4 bytes, 8 byte things to 8, and so on. However, most allocators also have a minimum size, so you don't really get 2-byte things allocated to any of 0,2,4,6,8,a,c and e. Instead they're allocated 4 bytes of space, and are treated as such by the allocator. Things of non-power-of-two sizes (like a 20 byte struct) tend to get "padded" up to the next power of two (or some other convenient size, it depends on the compiler).
- If you're paying close attention after this, you'll notice that I'm changing some of the values in ways that aren't explained, some of the pointer values don't point to what you expect, and the code that appears later will look pretty weird; not like a normal linked list at all. This is because the Solaris kernel's variadic (i.e. implemented in macros) linked lists are a bit odd; a struct (we'll call it the node struct) that wants to be part of a linked list has a field containing a link struct; the link struct has a tqe_prev and tqe_next pointer. The tqe_next pointer points to the beginning of the next node struct (or NULL if it's the tail), but the tqe_prev pointer points to the tqe_next field of the previous node struct's link struct. This simplifies some of the operations; you don't have to special case inserting or removing at the head, because the tail of the list acts like a prev pointer, and the head acts like a next pointer. In the nlm_host struct, the tqe_next appears at addr+0x10, and the tqe_prev field appears at addr+0x18.-
- You might wonder why the linked list didn't check for this case and panic earlier, because it should have been able to detect it; the answer is that the error checking code is unnecessary if the code has no bugs, and can cause a significant performance ding if you're doing a lot of linked list work. As a result, it's turned off by default, but a debug kernel build turns it on.