A few of weeks ago, Matt and I integrated OpenZFS device removal into the Delphix downstream repository of Illumos. I first presented this work at the OpenZFS developer summit, but this blog post will provide a high level overview of the feature and some exploration into some of the more interesting challenges we faced.
What is device removal?
Device removal allows an administrator to remove top level disks from a storage pool (there are already mechanisms for removing log and cache devices and detaching disks from a mirror). This feature is useful in a number of cases ranging from "Oops I meant to attach that disk as a mirror" to "I don't need as much storage capacity in this pool as I thought." An example of device removal:
# zpool remove test c2t1d0 # zpool status -v test pool: test state: ONLINE scan: none requested remove: Evacuation of vdev 1 in progress since Mon Dec 10 08:06:43 2014 340M copied out of 405M at 67.5M/s, 83.90% done, 0h0m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0
How does device removal work?
At a high level, device removal takes 3 steps:
- Disable allocations to the old disk
- Copy data from old disk to other disks in the pool
- Remove device from the pool
During this process, the challenge is keeping track of the copied data. The naive approach of building a map from block pointers (BPs) on the old device to BPs on the new device quickly requires an unreasonably large data structure that must be stored both on disk and kept in memory at all times it is in use. Even a technique like the now-defunct BP rewrite project would require a huge temporary table (and brings its own set of problems). We needed to adopt an approach that would allow a layer of indirection but use less memory. Ultimately, we decided to construct the mapping table at a lower layer: in the SPA. At this layer, we could ignore block boundaries and instead manage contiguously allocated segments on the old disk: the table would map an offset and length on the old disk to a new offset on another disk in the pool. For very fragmented disks, this could still require a lot of memory1, but in the general case this is a huge reduction in the number of entries managed by our indirect mapping table. Another benefit to this approach is we can construct the mapping table by iterating through the allocated segments on the old disk in LBA order, allowing us to do the reads from the old disk in LBA order.
Open context removal
When iterating through the segments on the disk, we did not use the existing scan framework. There were 2 main reasons for this: the existing scan code for scrub and resilvering operated at too high a layer (it iterated through BPs in the DMU, not allocated segments in the SPA) and had confusing performance characteristics2. Instead, we developed a new approach that used a background thread running in open context to do most of the work and used sync tasks to update the on-disk data structures. Every txg, the open context thread will:
- Iterate over a number of allocated segments
- Allocate new locations for those segments on other disks in the pool. If there is no empty region anywhere in the pool large enough for a segment (e.g. we find a 4GB contiguously allocated segment on the old disk), the segment is split into smaller pieces until the allocation succeeds.
- Copy the data to the new locations
- Issue a sync task that updates the indirect mapping table with entries for the newly copied segments and records the progress made this txg in the MOS
Unfortunately, this process can lead to "split blocks": since we do not know where the DMU block boundaries are when iterating through allocated segments, we might might create a split in the middle of a DMU block. As a consequence, we must piece together the data for these blocks when doing reads or frees to them after the removal has completed.
Frees during a removal
In order to know that we have copied every segment off the old disk, we have to be careful not to change which segments are allocated to it while we are iterating over it. The set of segments on a disk will change if new segments are allocated to it or segments are freed from it. Although we can disable new allocations to the disk during a removal, frees are more complicated; we cannot simply disable or delay frees of the data without messing up the space accounting in the DMU. Instead, we allow frees to proceed to the old disk and carefully handle the interaction with the in-progress removal. We always free from the old disk, and then handle one (or more, because of split blocks) of the following 3 cases depending on the progress of the removal thread:
- If the free is to a segment that has already been mapped, we free from the new location in addition to the old location.
- If the free is to a segment that has not yet been mapped, our free from the old location will cause us to never create a mapping for it.
- If the free is to a segment that is currently being mapped (i.e. we are in the process of copying the data), we delay the free to the new location until the sync task after the data has been copied.
This work is currently undergoing testing on our internal fork of Illumos. We're actively working on a couple more features before we're ready to put the code up for general review:
- We want to provide the ability to reduce the amount of memory overhead from removing a device. We're currently looking into the ability to rewrite indirect blocks to remove references to the old disk, both proactively and when indirect block is rewritten as part of normal writes. In addition, we would like to provide a mechanism to delete unused sections of the mapping.
- Currently, device removal only works if all top level vdevs are plain disks / files. At the very least, we want to make device removal interact intelligently with mirrored disks.
We'll publish the code when we ship our next release, probably in march, but we won't integrate into Illumos until we address all the future work issues.
- There is a future improvement that would make the very fragmented case a lot better: we could completely map a large (say, 128k) region that was very fragmented on the old disk to a new disk, then create holes in the new region so that it was fragmented in the exact same way as the old region. In this way, we could use a single mapping entry to represent a region containing many segments. a+(c)
- By time slicing disk activity between async writes and scrubs, neither can get the maximum amount of possible throughput. a+(c)
Writeup by Alex Reece, see me on Google+.