Data Replication: Approaching the Problem

With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers.

With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a "new feature"? The reason is that it's an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.

Where did we come from?

Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you're in charge of data consistency for disaster recovery, "unreliable" is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.

  • Analysis of the business problem - At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it's not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.

  • Data protocol choice - There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it's tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).

  • Outsourcing of work - At the time this architecture was created, it was decided that NDMP was not part of the company's core competency, and we should contract with a third party to provide the NDMP solution. I'm a firm believer that engineering work should never be outsourced unless it's known ahead of time that the result will be thrown away. Otherwise, you're inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects - we didn't even have source available.

  • Architectural design - By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn't control. This made it difficult to adapt to core changes in the underlying abstractions.

  • Algorithmic design - There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.

  • Implementation - The implementation itself was built to be "isolated" of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state - the target and source could get out of sync such that the only resolution was a new full replication from scratch.

  • Test infrastructure - There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.

Ideals for a new system

Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:

  • Separation of mechanism from protocol - Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.

  • Support for arbitrary topologies - We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.

  • Robust test infrastructure - We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.

  • Integrated with core object model - There should be one place where object definitions are maintained, such that the replication system can't get out of sync with the primary source code.

  • Resilient to failure - No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.

  • Clear error messages - Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.

At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI. Next, I'll discuss the first major piece of work: building a better NDMP implementation.