Blog

LUN Expansion

A few months ago Adam Leventhal wrote about "metaslabs and growi

A few months ago Adam Leventhal wrote about "metaslabs and growing vdevs" so I thought it appropriate to discuss how LUN expansion happens under the covers. As Adam mentioned when space is added to a vdev the LUN is expanded by creating new metaslabs but before this can happen there are several steps that must be performed. In this post I'll go into detail about how LUN expansion works (the focus will be on the manual operation, i.e. zpool online -e): I should mention that LUN expansion requires EFI-labeled disks. It's possible to expand a VTOC-labled device but it's a tricky, manual process. The first thing that ZFS must do is to relabel the existing device. It's the label that will ultimately be used to determine the new size of the device so it must happen first. ZFS will inspect the label and determine if there is any space beyond the reserved partition of the EFI label and if so it will get added that to the last non-zero partition (see efi_use_whole_disk(3EXT) for more info). This happens in zpool_relabel_disk(). Once the EFI label on the device as been updated we're ready to bring that space online.

int
vdev_online(spa_t *spa, uint64_t guid, uint64_t flags, vdev_state_t *newstate)
{
<snip>
        vdev_reopen(tvd);
<snip>
        if ((flags & ZFS_ONLINE_EXPAND) || spa->spa_autoexpand) {

                /* XXX - L2ARC 1.0 does not support expansion */
                if (vd->vdev_aux)
                        return (spa_vdev_state_exit(spa, vd, ENOTSUP));
                spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
        }
<snip>
}

Let's first look at the reopen logic to see what's happening and then we'll come back to the async logic that occurs afterwards. Recall that we've updated the label prior to calling the above routine. From the code above we know that we're only reopening a top-level vdev. Let's look at an example of the code flow for a mirrored top-level vdev before we dive in further:


 
vdev_reopen()
    vdev_open()
        vdev_mirror_open()
            vdev_open_children() --> vdev_open_child()
                                         vdev_open()
                                             vdev_disk_open()
                                 --> vdev_open_child()
                                         vdev_open()
                                             vdev_disk_open()
 

You can see from the code flow above that we start with opening the top-level vdev and work our way to the leaves. So let's dive down to the bottom to see what happens at the leaf level:

static int
vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
    uint64_t *ashift)
{
<snip>
skip_open:
        /*
         * Determine the actual size of the device.
         */
        if (ldi_get_size(dvd->vd_lh, psize) != 0) {
                vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
                return (EINVAL);
        }
<snip>
}

This function can tell that a reopen was requested and jumps down to the 'skip_open' label. It's here that the vdev will discover the additional space that was added to the label. Once the leaf vdev has obtained the updated size it will return and we'll start to unwind the call flow depicted above. So let's look at the vdev_open() code to see what happens next:

int
vdev_open(vdev_t *vd)
{
<snip>
        /*
         * If all children are healthy and the asize has increased,
         * then we've experienced dynamic LUN growth.  If automatic
         * expansion is enabled then use the additional space.
         */
        if (vd->vdev_state == VDEV_STATE_HEALTHY && asize > vd->vdev_asize &&
            (vd->vdev_expanding || spa->spa_autoexpand))
                vd->vdev_asize = asize;
<snip>
}

We can see that ZFS will only update the vdev's allocatable size for healthy devices that are either manually expanding or on pools with the autoexpand property set. But this alone is not enough to create additional metaslabs. It's worth mentioning that the vdev's allocatable size is only relevant to top-level vdevs. Since we expanded only one side of the mirrored top-level we would not actually see any benefit. In order to grow our pool and create additional metaslabs we would  have to run 'zpool online -e' on the other size of the mirror. This is true for RAIDZ devices too -- you must grow each of the leaf vdevs before ZFS will make the increased space available. For our mirrored example we can see where this happens:

static int
vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *max_asize,
    uint64_t *ashift)
{
<snip>
        for (int c = 0; c < vd->vdev_children; c++) {
                vdev_t *cvd = vd->vdev_child[c];

                if (cvd->vdev_open_error) {
                        lasterror = cvd->vdev_open_error;
                        numerrors++;
                        continue;
                }

                *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1;
                *max_asize = MIN(*max_asize - 1, cvd->vdev_max_asize - 1) + 1;
                *ashift = MAX(*ashift, cvd->vdev_ashift);
        }
<snip>
}

We can see that the asize (allocatable size) for the mirrored device is the minimum allocatable size of all its children. And now that all the leaf vdevs have been expanded we're ready to move on with our example. So lets recap, we relabeled each leaf vdev, each leaf vdev discovered the additional space, and finally the top-level has increase its asize. That means that we've unwound the code flow above all the way back to vdev_online():

int
vdev_online(spa_t *spa, uint64_t guid, uint64_t flags, vdev_state_t *newstate)
{
<snip>
        vdev_reopen(tvd);
<snip>
        if ((flags & ZFS_ONLINE_EXPAND) || spa->spa_autoexpand) {

                /* XXX - L2ARC 1.0 does not support expansion */
                if (vd->vdev_aux)
                        return (spa_vdev_state_exit(spa, vd, ENOTSUP));
                spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
        }
<snip>
}

We're now ready to perform the spa_async_request(). It's this section that will be responsible for making the newly expanded space available to the pool. So let's look at spa_config_update() to see how this happens:

void
spa_config_update(spa_t *spa, int what)
{
<snip>
                /*
                 * If we have top-level vdevs that were added but have
                 * not yet been prepared for allocation, do that now.
                 * (It's safe now because the config cache is up to date,
                 * so it will be able to translate the new DVAs.)
                 * See comments in spa_vdev_add() for full details.
                 */
                for (c = 0; c < rvd->vdev_children; c++) {
                        vdev_t *tvd = rvd->vdev_child[c];
                        if (tvd->vdev_ms_array == 0)
                                vdev_metaslab_set_size(tvd);
                        vdev_expand(tvd, txg);
                }
<snip>
}

Adam has already talked about vdev_metaslab_set_size() and that is used when we first create the pool. Since we're expanding we don't go into that function and instead call vdev_expand():

void             
vdev_expand(vdev_t *vd, uint64_t txg)
{               
        ASSERT(vd->vdev_top == vd);
        ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL);

        if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count) {
                VERIFY(vdev_metaslab_init(vd, txg) == 0);
                vdev_config_dirty(vd);
        }
}

There is one subtle thing to note in the code above. In order for ZFS to create new metaslabs the allocatable space must have grown by the size of existing metaslabs. So if the current metaslab size is 4GB and you only expand the disk by 2GB then you won't have increased the disk enough to create a new metaslab. We'll continue with our example under the assumption that we were able to initialize the new metaslabs. I'll leave the deep dive of vdev_metaslab_init() to the reader but mention that this function will have created the additional metaslab but the space will not have been added to the pool quite yet. The space will not become available until next time we sync out a transaction group (see spa_config_update() t0 see where this happens). And that brings us to the final step. When a transaction group syncs we will end up calling metaslab_sync_done() on metaslabs that are considered dirty (newly created metaslabs were dirtied when we called vdev_metaslab_init()). It's in this function that we finally make the space accessible to the pool:

void
metaslab_sync_done(metaslab_t *msp, uint64_t txg)
{
<snip>
        /*
         * If this metaslab is just becoming available, initialize its
         * allocmaps and freemaps and add its capacity to the vdev.
         */
        if (freed_map->sm_size == 0) {
                for (int t = 0; t < TXG_SIZE; t++) {
                        space_map_create(&msp->ms_allocmap[t], sm->sm_start,
                            sm->sm_size, sm->sm_shift, sm->sm_lock);
                        space_map_create(&msp->ms_freemap[t], sm->sm_start,
                            sm->sm_size, sm->sm_shift, sm->sm_lock);
                }

                for (int t = 0; t < TXG_DEFER_SIZE; t++)
                        space_map_create(&msp->ms_defermap[t], sm->sm_start,
                            sm->sm_size, sm->sm_shift, sm->sm_lock);

                vdev_space_update(vd, 0, 0, sm->sm_size);
        }
<snip>
}

Here we finally see the space added (vdev_space_update()) to the pool along with some space maps that will allow allocations and frees to take place on this metaslab. The last remaining piece to the LUN expansion puzzle is 'autoexpand'. I'll leave that as an exercise to the reader but point those interested to zfsdle_vdev_online().