Blog

Debugging unmounting problems with DTrace

The Analysis

A few weeks ago I ran in to a reoccurring problem during the unit tests for our product, the important part of the error message was always something like:
The Analysis

A few weeks ago I ran in to a reoccurring problem during the unit tests for our product, the important part of the error message was always something like: cannot unmount '/test-domain/group-0/container-22/temp': Device busy could not destroy 'test-domain': could not unmount datasets A little investigation showed that error message was coming from a 'zpool destroy' command in a unit test framework script. The 'zpool destroy' command unmounts all file systems on the pool before it frees the associated data. In this case the unmounts were failing. This was presumably because some rogue process was accessing the file system, which was surprising since the tests were running on a stripped down VM with no other users and no processes which had any business touching the file systems created by the unit tests. A bit more investigation yielded some disappointing results:

  • The error was not reliably reproducible. It occurred often enough to be annoying because it would cause an otherwise clean unit test run to fail, but infrequently enough that you would need to repeatedly run the tests for hours before hitting it.
  • The error did not occur after any particular test or on any particular file system. Some tests seemed to fail more often, but almost all tests failed at least once.
  • After the test had failed the same zpool destroy command run manually would succeed, so it was too late to try to identify the process accessing the file system.

Given that last point, a logical next step was to find out where in the kernel the unmount occured and force a crash dump when it failed. Then we could inspect the crash dump to see which process was accessing that file system. Browsing the source to our Illumos-based OS yielded the function zfs_umount(vfs_t *vfsp, int fflag, cred_t *cr) in zfs_vfsops.c as a likely place for the unmount to fail, and this DTrace one-liner caused it to crash dump on failure: dtrace -wn "fbt::zfs_umount:return /arg1 == EBUSY/ { panic(); }" For those not familiar with DTrace here is the breakdown:

  • The probe fbt::zfs_umount:return fires when zfs_umount() returns.
  • The predicate /arg1 == EBUSY/ says that the listed actions should only be executed when the value being returned (arg1) is EBUSY.
  • The panic() action panics the kernel. The -w option is necessary to put DTrace into "destructive mode" to allow the panic() action to be run.

The crash dump showed that the vfs_count field of of the vfs_t passed in to zfs_umount() was greater than 1 which is what caused the unmount to fail. Unfortunately vfs_count is a simple integer reference count managed by calls to vfs_hold(vfs_t *vfsp) and vfs_rele(vfs_t *vfsp) with no back pointers to holders (I was hoping to find something like ZFS's refcount_t). I shared these findings with Adam Leventhal, a co-worker, who pointed out that the holds must be released soon after the unmount fails because later attempts to destroy the pool succeed. That observation led to this DTrace script which outputs stack traces of processes which release their holds on a vfs_t which failed to unmount:

/* unmount.d */

/*
 * We will store a pointer to the vfs_t which failed to unmount in
 * the 'traceme' variable, it starts as NULL.
 */

/*
 * When entering zfs_umount() remember the vfs_t argument in case
 * the unmount fails.
 */
fbt::zfs_umount:entry
{
        self->zfsvfs = arg0;
}

/*
 * If zfs_unmount() fails store the failed vfs_t argument into
 * the 'traceme' variable.
 */
fbt::zfs_umount:return
/arg1 == EBUSY/
{
        traceme = self->zfsvfs;
}

/*
 * When vfs_rele() is called on the same vfs_t that previously
 * failed to unmount print lots of verbose debugging
 * information.
 */
fbt::vfs_rele:entry
/arg0 == traceme/
{
        printf("%s[%d,%d]: %s", execname, pid, ppid,
            curpsinfo->pr_psargs);
        stack();
        ustack();
}

And after a few unit test runs the culprit announced himself:

$ dtrace -s umount.d
CPU     ID       FUNCTION:NAME
  0  21673      vfs_rele:entry vmtoolsd[882,1]: /usr/lib/vmware-tools/sbin/amd64/vmtoolsd
              zfs`zfs_znode_free+0x80
              zfs`zfs_zinactive+0x9b
              zfs`zfs_inactive+0x11c
              genunix`fop_inactive+0xaf
              genunix`vn_rele+0x5f
              genunix`lookuppnvp+0x99a
              genunix`lookuppnatcred+0x11b
              genunix`lookupnameatcred+0x97
              genunix`lookupnameat+0x69
              genunix`cstatat_getvp+0x12b
              genunix`cstatat+0x5c
              genunix`fstatat+0x4c
              genunix`stat+0x25
              unix`sys_syscall+0x17a

              libc.so.1`_syscall6+0x1b
              libvmtools.so`Posix_Stat+0x9f
              libvmtools.so`0xfffffd7ffda5b5b3
              libvmtools.so`WiperPartition_Open+0x76
              libvmtools.so`GuestInfo_GetDiskInfo+0x43
              libguestInfo.so`0xfffffd7ffd202acc
              libglib-2.0.so.0`0xfffffd7ffea3ac4d
              libglib-2.0.so.0`0xfffffd7ffea380d3
              libglib-2.0.so.0`g_main_context_dispatch+0x6d
              libglib-2.0.so.0`0xfffffd7ffea397be
              libglib-2.0.so.0`g_main_loop_run+0x2b0
              vmtoolsd`0x403140
              vmtoolsd`0x402c1d
              vmtoolsd`0x40288c

  0  21673      vfs_rele:entry vmtoolsd[882,1]: /usr/lib/vmware-tools/sbin/amd64/vmtoolsd
              zfs`zfs_znode_free+0x80
              zfs`zfs_zinactive+0x9b
              zfs`zfs_inactive+0x11c
              genunix`fop_inactive+0xaf
              genunix`vn_rele+0x5f
              genunix`cstatat+0x8a
              genunix`fstatat+0x4c
              genunix`stat+0x25
              unix`sys_syscall+0x17a

              libc.so.1`_syscall6+0x1b
              libvmtools.so`Posix_Stat+0x9f
              libvmtools.so`0xfffffd7ffda5b5b3
              libvmtools.so`WiperPartition_Open+0x76
              libvmtools.so`GuestInfo_GetDiskInfo+0x43
              libguestInfo.so`0xfffffd7ffd202acc
              libglib-2.0.so.0`0xfffffd7ffea3ac4d
              libglib-2.0.so.0`0xfffffd7ffea380d3
              libglib-2.0.so.0`g_main_context_dispatch+0x6d
              libglib-2.0.so.0`0xfffffd7ffea397be
              libglib-2.0.so.0`g_main_loop_run+0x2b0
              vmtoolsd`0x403140
              vmtoolsd`0x402c1d
              vmtoolsd`0x40288c

It turns out that VMware Tools installs an init script which starts a vmtoolsd binary which appears to be calling stat(2) in a way that places a hold on the file system we want to unmount. While the tools are distributed as a binary, some variant of the source is available from VMware's open-vm-tools project. Luckily the DTrace output had already pointed us in the right direction within the tool's code base: somewhere around the GuestInfo_GetDiskInfo() or WiperPartition_Open() functions. Reading the code revealed that vmtoolsd gathers system information from the guest operating system every few seconds. As part of this data gathering it gets a list of mounted file systems from /etc/mnttab and calls stat(2) on each of their mount points. This explained the behavior we were seeing in the unit tests: any test which unmounted file systems as part of its cleanup had a chance to race with vmtoolsd's stat(2) call. I'm going to pause here and say that DTrace has just helped us do something pretty neat. With about 10 lines of code we were able to tell not just which process was causing our problem, but also which functions in the process's binary were responsible. The script is safe to run in production and should work for finding other instances of rogue processes interfering with unmount operations (although as-written it is ZFS specific). Recently Matt Ahrens, another colleague here at Delphix, used it to find a similar race he was experiencing with the NFS locking daemon.

The "Solution"

We couldn't uninstall or disable vmtoolsd without losing other features of VMware tools, but closer inspection of the VMware tools source revealed that:

  • The information from stat(2) is only used to decide if the file system is mounted on a "disk device" as part of determining if VMware supports some fancy features for that file system, but since ZFS is not supported the check fails later anyway.
  • Due to ifdefs the results of the stat(2) operation are actually ignored on Solaris-based operating systems.

Since we were pretty confident these stat(2) calls were unnecessary on our system Adam suggested we try to trick vmtoolsd into thinking there were no mounted file systems, and so we created this abomination:

/* subvert-vmtoolsd.d */

#pragma D option destructive

syscall::open:entry
/execname == "vmtoolsd" && copyinstr(arg0) == "/etc/mnttab"/
{
    this->f = "/dev/null";
    copyout(this->f, arg0, strlen(this->f) + 1);
}

Unlike the umount.d script this one is definitely not safe for production. This script intercepts all open(2) calls made by vmtoolsd. If the file being opened is "/etc/mnttab" DTrace will overwrite the file name in vmtoolsd's memory with "/dev/null" so it ends up opening what appears to be an empty file. Note that the fact that /dev/null is fewer characters than /etc/mnttab is important here, if it were longer it might not fit into the buffer vmtoolsd has allocated to hold the string it is passing to open(2). Clearly this is a sub-optimal solution, but it seems to be working for now. To confirm we checked the output of: dtrace -n "syscall::stat:entry/execname == "vmtoolsd"/ { print("%s", copyinstr(arg0); }" Which no longer shows our file system mount points. It looks like the latest development snapshots of open-vm-tools don't have this problem, so this will keep our unit tests happy until those changes get pulled into the official VMware tools, or we get around to building the development snapshot ourselves.