The Life of a Delphix Intern: Sara Hartse
Written by Sara Hartse (Brown University '17) - I'm Sara, an intern this summer at Delphix on the Platform Team. I'm about to start my senior year studying computer science at Brown. I'm writing this to explore the significance of this internship and what I've learned from being part of Delphix for a summer.
How did I end up here?
During college I've tried to spend my summers both finding new technical areas to explore and working and living somewhere new. In the past I've been at a national lab, at a tiny startup in my hometown and at a massive company in Seattle. Delphix appealed to me because it's a very different type of company and is located somewhere I wanted to explore living.
What did I do?
I joined the Platform Team, the group within Delphix that is responsible for maintaining and developing the underlying operating system and filesystem (illumos and ZFS) that allow the Delphix product to do what it does. I entered the team with some basic systems experience and an attitude that now was as good a time as any to get a sense of how this stuff works. I spent the first weeks learning about what made ZFS special while being surrounded by leading members of the OpenZFS community (even sitting right next to one of its original developers).
The project I picked consisted of building upon a new ZFS feature called channel programs. Channel programs are a ZFS tool motivated by the desire to quickly and safely compose system operations. There are many ZFS operations for manipulating datasets (filesystem, snapshots, clones, etc.) that are effectively system calls but that we want to do repeatedly. For example, deleting all the snapshots of a filesystem would have the following structure.
- get all snapshots
- for each snapshot:
- userland -> delete call -> syncing context
Significantly, each of these system calls must be executed in syncing context; the period in which ZFS stops accepting user commands and writes out all data to disk. This means that we have to go from open context to syncing context for each deletion. This causes two problems. First, completing a sync usually takes a few seconds, which is a performance hit. Secondly, since we return to open context between deletions there is an opportunity for the state of the filesystem to change mid loop (maybe more snapshots are taken) and cause consistency problems. Essentially this means we can't do it atomically.
A channel program addresses these issues by pulling all of the necessary logic into a script (written in LUA) that is then executed in its entirely in the kernel. The multiple delete system calls are replaced with a single channel program call and all the deletions can be executed in a single sync.
- userland -> channel program call -> syncing context -> get all snapshots
delete each snapshot
With a channel program these commands to be executed more quickly (only one sync) and with an atomicity guarantee. This talk is a good overview of channel programs.
Currently, channel programs support of some of the ZFS operations you might want to use. What I've been working on is expanding their capabilities to getting and setting ZFS dataset properties. ZFS datasets have many properties (about 60), everything from creation time and space used to mount location and snapshot limit. The challenge of this project was figuring out where all these properties are stored and reasoning about whether or not they should be accessible from within channel programs. For example, the userquota property can accept a username which needs to be resolved into a numeric id by contacting a LDAP server. We didn't want the success of a sync to be dependant on such an external process, so I imposed the requirement that the the property can only be accessed with the raw numeric id.
Once getting and setting properties is supported it will be possible to access or assign multiple variables all at once instead of over multiple syncs. If we can atomically gather ZFS metadata state this will solve problems where scripts gather an inconsistent view of the filesystem. For example, the replications team has problems with the list of filesystems not being consistent with the list stored in the Delphix metadata database. Channel programs now allow us to atomically create a snapshot of the filesystem containing the metadata database, and get the list of filesystems and their properties.
What did I learn?
This summer I was exposed every day to people working on different and fascinating projects within the Platform Team and around the company. One of my favorite things about Delphix was that the company's size meant I was always interacting with people doing very different things than I was and getting a sense of how the whole company fits together.
On the technical side, I learned a lot about filesystems and operating systems generally. I spent time navigating code paths to see how ZFS datasets are created, destroyed and all the adventures that can happen to them along the way. I learned to face the unique challenges of tracking down bugs when the program I'm changing is literally the operating system I'm interacting with. I learned how to cause a kernel panic on purpose and how to not panic when I somehow break something to a point that it can't boot. Overall, I had a great time being surrounded by people who wanted to help me learn and were excited about what I was creating.