Blog

ZFS Channel Programs

In our work on ZFS at Delphix, we've found ourselves frequently needing to implement new ZFS administrative commands and add features to existing ones.
Thumbnail

This blog post is by Chris Williamson, who owned delivery of the ZFS channel programs feature at Delphix.

Motivation

In our work on ZFS at Delphix, we've found ourselves frequently needing to implement new ZFS administrative commands and add features to existing ones.

For example, "zfs destroy" provides several different mechanisms for destroying datasets: a single filesystem can be specified, with optional flags to recursively destroy children or dependents, or as a list of snapshots. This is occasionally limiting, since we may want to destroy some collection of datasets which aren't snapshots and aren't all descendents of the same root filesystem. If we want to perform this operation, it's necessary to either use an external program or script to call "zfs destroy" individually on each filesystem, or to add a new flag to the CLI and hard-code the particular behavior we want.

Similarly, we often discover use cases which don't justify entirely new ZFS administrative operations, but do need to combine several existing operations into one compound one. One such instance we ran into recently at Delphix requires calling "zfs list" and "zfs snapshot" simultaneously, where the filesystem list and snapshot need to be consistent with one another. Ensuring these two commands both see the ZFS pool in the same state would require implementing a new command which invokes both of them.

In general, we're beginning to see a pattern of adding new calls which are simply compound operations of multiple others, or an existing operation iterating over a particular list of datasets instead of a single one. However, because many of these implementations require slightly different behaviors and error handling, it's often difficult for them to share code, resulting in a lot of repetitive re-implementation of similar routines.

Enter channel programs

As such, the idea of a "channel program" (a holdover term from the days of the IBM mainframe) provided a natural common interface to allow scripted combinations of these various operations. Rather than hardcoding many slightly different sequences of commands to be exposed to a user, we provide a script-based interface, where a user can write a Lua script which makes calls (through a library we provide) to ZFS. These channel program scripts are able to easily express whatever particular iteration, error handling, or other combination of ZFS operations which might be necessary for a particular task.

Background: ZFS Transaction Groups

We'll dive more into specific implementation details in a later post, but some background on how ZFS organizes filesystem state changes is helpful for understanding where channel programs are useful. Adam Leventhal has written a thorough explanation of how this system works, but the main relevant details are:

  • Filesystem changes are batched into groups of transactions, which alternate between open (registering and executing new transactions) and syncing (applying and writing out changes) states.

  • Some administrative operations (e.g. destroying a snapshot) must be performed in syncing context. We generally refer to these operations as synctasks.

  • Time between transaction group syncs can vary widely, from a few milliseconds to several seconds on high-load systems.

Using channel programs, a series of administrative operations can be executed entirely during a single synctask. Otherwise, destroying a specific list of datasets would require a transaction group sync to finish for each deletion. Similarly, we can ensure that a sequence of changes occurs atomically and not be interrupted by any other change, by executing the entire sequence as one synctask.

Choosing a language

Lua proved a good choice for a language to use for channel program scripts for a number of reasons. Despite some linguistic quirks (1-indexed arrays, lack of distinction between undefined and "nil" values), the language is very small, approachable, and easy to sandbox and embed in existing programs. Adding library functions and hooks is relatively straightforward, and the interpreter itself is highly configurable.

It was necessary to modify a number of parts of the Lua interpreter to suit our needs. Since the interpreter is running in the kernel, performing file I/O and spawning processes are off the table, so a number of Lua standard library functions had to be disabled. The interpreter's error handling also needed tweaking - when running on its own, Lua responding to a fatal error by panicking and exiting is reasonable behavior, but in our usage we need to safely return from the channel program invocation, ideally with a helpful error message. Lua also uses floats for numbers by default, which isn't terribly useful for any of the numerical values a channel program is expected to deal with, so these have been changed to 64-bit integers.

A note on security

Given that this feature adds the ability to run arbitrary scripts in kernel mode, security is a potential issue, so executing channel programs requires the SYS_CONFIG privilege (the same privilege required for other ZFS operations which create and destroy datasets). Furthermore, we've added Lua memory and instruction count limits to prevent poorly-written or malicious Lua scripts from consuming all the memory in the kernel or causing a transaction group sync to block forever. Even though we've done a fair bit of work to ensure that a ZCP script can't total the rest of the kernel, for now we're also restricting its use to root in the global zone.

OK, OK, how do I use it?

There are two ways to invoke a channel program as a user: via the ZFS command line with the "zfs program" command, and through LibZFS. The CLI interface is covered here--for details on the few points where the LibZFS interface differs, as well as for everything below, see the zfs-program(1M) man page on a system with a recent build of ZFS.

For a general reference for the Lua language for writing ZCP scripts, the Lua 5.2 reference manual is available, and numerous tutorials exist for those wanting to get acquainted with Lua from the ground up. Regardless, I'll try to point out any syntactic oddities below as they crop up.

The basic invocation of a channel program generally looks something like this:

# zfs program rpool script.zcp $script_argument

Any extra arguments provided will be accessible from the ZCP script.

By default, a channel program will be automatically stopped if it runs longer than 10 million Lua instructions (Lua counts roughly 1 "instruction" per expression in the program) or uses more than 10MB of memory, but larger limits can be specified with the "-t" and "-m" flags.

Input and output

Any extra command line arguments passed to the "zfs program" command will be passed as strings to the channel program script. These are accessible in the script through Lua's variable-length argument syntax:

args = ...
argv = args['argv']
assert(argv[1] == "scriptargument")

Values may also be returned from a ZCP script. Unlike other Lua functions, where multiple return values are permitted, only a single value may be returned. The return value of the ZCP script will be formatted and printed out when the script exits. So return 1 at the top level gives:

Channel program fully executed with return value:
    return: 1

If you need to return multiple values, a Lua table can be used. For example, "return {foo=12, bar="baz"}" results in:

Channel program fully executed with return value:
    return:
        bar: 'baz'
        foo: 12

Note that the ordering of the returned table values is not defined.

Provided libraries

In addition to most of the base Lua built-in functions and the "coroutine", "string", and "table" modules, we've added the "zfs" module, which contains all ZFS functions accessible to a ZCP script. It contains several submodules:

  • At the top level: "zfs.debug()" and "zfs.get_prop()" allow printing messages to the zfs_dbgmsg log and retrieving properties, respectively.

  • "zfs.list" contains iterator functions for looping over collections of ZFS datasets. For example, performing some operation on all snapshots of some filesystem could be done like so:

    for snap in zfs.list.snapshots("rpool/myfs") do
      -- do something here
    end
  • "zfs.sync" contains all synctask functions which can modify the state of the pool. For example, "zfs.sync.destroy()" destroys a filesystem. These functions all return an error code on failure, or 0 on success.

    Some "zfs.sync" functions such as "zfs.sync.promote()" may also return a Lua table as a second argument in some error conditions which specifies additional details about the error. These functions can either be called capturing only the error code, or both return values. If there is no second return value, it will be returned as nil. So all of the following invocations are valid:

    zfs.sync.promote("rpool/clone")
    err = zfs.sync.promote("rpool/clone")
    err, conflicting_snapshots = zfs.sync.promote("rpool/clone")

    Any error codes returned from zfs have global aliases defined for use in ZCP scripts, so for example, "if (err == EPERM)" is valid.

  • "zfs.check" contains the same set of functions as "zfs.sync", which do a dry-run of the corresponding "zfs.sync" operation. This can be used to gracefully handle any errors which may occur in a ZCP script. A common pattern is something like the following:

    datasets = ...
    for ds in datasets do
        if (zfs.check.destroy(ds) != 0) then
            return -1
        end
    end
    for ds in datasets do
        assert(zfs.sync.destroy(ds) == 0)
    end

    This is not foolproof, since the state of the pool can change between corresponding "zfs.check" and "zfs.sync" calls, but it does help prevent common cases where a series of operations could fail partway through.

Writing a channel program

Let's say we wanted to write a channel program to do batch deletion of a number of datasets. One very simple way to do this would be to pass the datasets to be destroyed as arguments:

args = ...
to_destroy = args["argv"]
for ds in to_destroy do
    zfs.sync.destroy(ds)
end

This works for simple cases, but has a number of problems. Having to pass every dataset on the command line is awkward, and we haven't done any error handling, so the script could silently fail to destroy any number of the datasets provided. Let's tackle these problems one at a time.

First, we can reduce to a single argument and replicate the behavior of "zfs destroy -R", recursively destroying all dependents of the dataset:

args = ...
destroy_root = args["argv"][1]

function destroy_recursive(root)
    for child in zfs.list.children(root) do
        destroy_recursive(child)
    end
    for snap in zfs.list.snapshots(root) do
        for clone in zfs.list.clones(snap) do
            destroy_recursive(clone)
        end
        zfs.sync.destroy(snap)
    end
    zfs.sync.destroy(root)
end

destroy_recursive(destroy_root)

This enforces the correct deletion order (children first), and is easier to use, but our other error handling problems remain. Say we want to try to make sure that our script is all-or-nothing, and will either fail or successfully destroy everything. We can accomplish this by first checking whether each destruction will succeed, and if any of them would fail, exiting before making any changes.

This is made slightly more complicated by the fact that attempting to destroy a dataset with dependents will return an error, but that same destroy operation may become valid after we destroy the dependents. Since in our case we always destroy all dependents of each dataset, we can safely ignore this error (ECHILD). If we didn't have this guarantee, we could still take an extra step to verify that any dependents of a dataset to be destroyed appeared earlier in the list.

args = ...
destroy_root = args["argv"][1]

-- recursively build the list of datasets to be destroyed, dependents first
function gather_destroy(root, to_destroy)
    for child in zfs.list.children(root) do
        to_destroy = gather_destroy(child, to_destroy)
    end
    for snap in zfs.list.snapshots(root) do
        for clone in zfs.list.clones(snap) do
            to_destroy = gather_destroy(clone, to_destroy)
        end
        table.insert(to_destroy, snap)
    end
    table.insert(to_destroy, root)
    return to_destroy
end

datasets = gather_destroy(destroy_root, {})

-- pre-check all of our destroy operations to see if any will fail
for ds in datasets do
    err = zfs.check.destroy(ds)
    if (err != 0 and err != ECHILD) then
        error("failed to destroy " .. ds .. " errno: " .. err)
    end
end
-- we're safe, actually destroy the datasets
for ds in datasets do
    assert(zfs.sync.destroy(ds) == 0)
end

Now, this channel program is effectively all-or-nothing, and will either succeed or fail, without leaving any partially-completed state.

Finally, let's make this ZCP script a little more flexible. Maybe we'd like to be able to mark certain filesystems as temporary and have this script run periodically to destroy them. This can be done by checking for a user property and running the above destroy operation if it's found. This gives us our final script:

args = ...
pool = args["argv"][1]

function gather_destroy(root, to_destroy)
    for child in zfs.list.children(root) do
        to_destroy = gather_destroy(child, to_destroy)
    end
    for snap in zfs.list.snapshots(root) do
        for clone in zfs.list.clones(snap) do
            to_destroy = gather_destroy(clone, to_destroy)
        end
        table.insert(to_destroy, snap)
    end
    table.insert(to_destroy, root)
    return to_destroy
end

function cleanup_dataset(root)
        datasets = gather_destroy(root, {})
        for ds in datasets do
            err = zfs.check.destroy(ds)
            if (err != 0 and err != ECHILD) then
                error("failed to destroy " .. ds .. " errno: " .. err)
            end
        end
        for ds in datasets do
            assert(zfs.sync.destroy(ds) == 0)
        end
    end
end

function recursive_cleanup(root)
    for child in zfs.list.children(root) do
        recursive_cleanup(child)
    end
    -- We may encounter these clones when recursing through children of some
    -- other filesystem, but we catch them here as well to make sure each is
    -- destroyed before its origin fs.
    for snap in zfs.list.snapshots(root) do
        for clone in zfs.list.clones(snap) do
            recursive_cleanup(clone)
        end
    end
    -- Only recursively destroy the dataset if it's marked for destruction
    if (zfs.get_prop(root, "gc:tmp_cleanup") == "yes") do
        cleanup_dataset(root)
    end
end

recursive_cleanup(pool)

Feature status, future work

The "zfs program" command is about to be integrated into the OpenZFS repository. The existing library is relatively small -- it will be expanded upon as we develop this feature. The behavior of existing library calls is stable and not expected to change significantly. The operations that are currently implemented are:

  • zfs list (on filesystems, snapshots, and clones)
  • zfs get (as well as listing user/system properties)
  • zfs destroy
  • zfs promote

A number of additional features are implemented we are waiting to upstream them until after the initial feature set lands in OpenZFS:

  • zfs snapshot
  • zfs rollback
  • zfs bookmarks
  • zfs holds
  • support for running read-only channel programs in open context

Other possible future work could include more fine-grained privilege checking, and additional operations such as:

  • zfs create
  • zfs set
  • zfs clone
  • zfs bookmark
  • zfs hold (and zfs release)

We hope you like the new channel programs interface!