OpenZFS Code Walk: Metaslabs and Space Maps
I wrote this post about a year ago while to trying to sort out how metaslabs worked with George Wilson. At the time, I found walking through the code with George to be exceptionally helpful, so I'm publishing this in case others find it useful. In OpenZFS, metaslabs are data structures used to track memory allocations by the spa. In particular, each vdev is split into ~200 regions (vdev_metaslab_set_size). The bookkeeping information for the allocator is represented in-memory as a collection of range trees and on-disk as a space map.
A space map is a fairly straightforward log of the free / allocated space for a region. Records are appended to the spacemap as space becomes freed / allocated from the region (with the last block being COWed as necessary) - this occurs in metaslab_sync when it performs a space_map_write of the allocated space and then of the free space. Over time, the long log of appended allocation records will cause the space map to become inefficient in its utilization of space. If metaslab_should_condense determines that a space map is too inefficient, then metaslab_condense will rewrite the whole space map rather than merely appending new records.
ms_tree and ms_size_tree
At the heart of the metaslab is ms_tree, a range tree representing the free space that is allocatable. This range tree is mirrored by ms_size_tree: the ms_size_tree stores the same information as the ms_tree, only the free segments in the ms_tree are ordered by their address and are ordered in the ms_size_tree by their size. In addition, there are 3 other trees used to track the space in the metaslab: the ms_alloctree, the ms_freetree, and the ms_defertree. Unfortunately, this is where the fact that this is a mature, optimized piece of systems software can obscure what is going on. Before we continue, it will become useful to briefly discuss transaction groups in ZFS.
All write zios are assigned to a transaction group (txg) when they are created. There are 3 active transaction groups at all times: one in each of open context, quiesceing context, and syncing context. In open context, space is reserved for the operation out of the dsl dir for the corresponding dataset and the ZIL writes out its intent to perform IOs. In syncing context, we write out all blocks that have changed (in spa_sync). This will probably cause the allocation of additional metadata blocks, dirtying more metaslabs, which in turn need to be synced out to their respective space maps, etc. We thus must write out in iterative passes until eventually we "converge" by having no more dirty blocks to write. To help ensure convergence, there are several optimizations we perform:
- After zfs_sync_pass_deferred_free passes, we do not write out space that is freed. Instead, we enqueue it on spa_deferred_bpobj to be freed in the next txg.
ms_alloctree and ms_freetree
The ms_alloctree and the ms_freetree contains all the blocks that were allocated / freed this sync pass. Unfortunately, we need to have a parallel copy of each of these trees for each possible txg state. This is because not all allocations / frees are performed in syncing context - the ZIL has the ability to perform writes and allocations outside of syncing context. At the end of every sync pass, all dirty metaslabs are synced via metaslab_sync.
In metaslab_sync, all dirty blocks from ms_alloctree and ms_freetree for the current (syncing) txg are written to the associated space map and then they is vacated. Hence, the ms_alloctree and ms_freetree for the syncing txg are empty at the end of a sync pass. ms_defertree The ms_defertree is where things get a bit complicated. Our goal is to be able to roll back up to TXG_DEFER_SIZE txgs if necessary. To do this, we need to ensure that blocks that are freed in txg n are not actually available to be allocated until txg n+TXG_DEFER_SIZE.
We ensure this by saving freed space in the ms_defertree for this txg rather than moving them directly into ms_tree after we sync the freed space to the space map. After TXG_DEFER_SIZE txgs, they are moved into ms_tree where they become available for allocation. Since we don't make the space in the ms_defertree available until the end of a txg, we accumulate all blocks freed in the syncing txg into a "freed" tree.
Because nobody is able to use it (we have at least 1 more slot for a txg than we have possible txg states, so it can't be in open context and the the ZIL cannot be modifying it), we use the ms_freetree from the previous transaction (TXG_CLEAN(txg)) for this freed tree. Note that the accounting of available space as represented on disk by the space map may not match the available space as represented in memory as the ms_tree of the metaslab.
Furthermore, we have a performance optimization that allows us to free blocks from a metaslab without reading it off disk (and merely appending the record to the space map). When doing so, the freed space are added to the metaslab's ms_defertree and are removed from the ms_tree when the metaslab is loaded in metaslab_load. Thus, the available space ms_defertrees and the ms_tree matches the available space as represented by the space map.
The life of a metaslab
The life of a metaslab is as follows:
- In open context, the ZIL can issue IOs, causing blocks to be allocated and freed.
- In syncing context:
- The spa_deferred_bpobj from the previous txg is synced, causing writes to the space map and potential allocations and frees.
- Every sync pass:
- All dirty vdevs is written, causing blocks to be allocated and freed.
- All metaslabs are synced:
- If it is the first sync pass and the metaslab is too inefficient, it is condensed. Otherwise, allocation / free records are appended to the space map.
- The ms_alloctree is emptied.
- The blocks in the ms_freetree are removed and added to the previous txgs ms_freetree.
- ms_alloctree and ms_freetree are now empty, and the previous txgs ms_freetree contains the accumlation of all blocks freed this txg.
- For each dirty metaslab, blocks in the ms_defertree are added to the ms_tree and the blocks in the previous txgs ms_freetree are moved to the ms_defertree.
Writeup by Alex Reece, see me on Google+.