Data Replication: Building a better NDMP
In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around. The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn't have the time to replace both the implementation and the data protocol for the current release. Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:
- Poor error semantics - The strategy was "log everything and worry about it later". For an implementation shipped with a roll-your-own OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared integrated with our UI.
- Embedded data semantics - The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn't fly in the Delphix world.
- Unused code - There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling code that did nothing.
- Standalone daemon - A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.
With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical "great NDMP implementation" waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast. The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon. The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the "tape library management" code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing. With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported "NDMP server error CRITICAL: consult server log for details" (of course, there was no way for the customer to get to the "server log"), we would now get much more helpful messages like "NDMP client reported an error during data operation: out of space". The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful 'ndmpcopy' implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool). It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn't include any actual data handling, so it's perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks. The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it's unlikely to be useful to the general masses, and certainly not something that we'll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix - it's a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it's a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library - one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested. Next up will be a post whose working title is "Data Replication: Metadata + Data = Crazy Pain in My Ass".