Your Data Transit Authority

The ever-increasing amounts of data that must transit among applications and environments create big challenges for the modern day enterprise. How do we bring data where it’s needed on-time, all the time like our best public transit systems? Learn how DataOps can transform data management and delivery.

Woody Evans

Mar 21, 2019

In Chicago, they call it the CTA. In New York, it’s the MTA. In San Francisco, it’s BART.  Whatever mass transit you’re using, you rely on it to move you from place to place. These transit authorities have a common vision: ensure a fast, safe and efficient transportation system to as many passengers as possible. DataOps has the same vision for your data.

Trains need a lot of infrastructure. Every time you want to add a station to a line, you’ve got to lay track, build the overhead power lines, put in the station, add railroad crossings, and more. Until you have all that, no passenger gets on a train. It’s the same with data. To move data to a new destination, there are all sorts of configurations.

Unlike the best passenger trains, our “data trains” are hard to get on and off, they’re slow, the trains don’t come very often, and connectivity between stations is spotty.

Why is that?

Data is heavy. You can be the most agile DevOps shop in the world, but if you need a copy of that 5 TB test dataset, buckle in for a few hours for every test. Imagine if it took 3 hours for every passenger to get on and off the train.

Data is brittle. It’s hard to maintain the state of even one dataset. When you get into many builds and many users, it gets even harder. When you get into many datasets, it’s near impossible. Imagine if, as you added more passengers to the train, the train tended to break down more often — that’s a failed transit system.

Data is disconnected.  It’s usually easy to get one dataset, or to get one masked copy of a dataset. But it’s hard to get a continuously updated version. It becomes a herculean effort to distribute an updated version among many consumers of data, including developers, QA, data scientists. and analysts. Not to mention, it’s damn near impossible to create a library of continuously updated datasets available to those consumers. It’s almost as if whenever you added more stations, your capacity to get trains to people went down.

With datasets, part of the issue is that we’re always laying new infrastructure. The first time we add a new dataset to a host, we have to do work to get it there. We repeat a lot of that work when we want to put a new version of that same dataset on that host. From there, it requires even more effort when it comes to conditioning that data by masking or subsetting it, which ultimately becomes quite an undertaking that is custom and very difficult to make repeatable without an investment.

Data is captive. The individual or team who needs data is usually not the one who controls how to get it. Let’s go back to the train analogy. Schedules, infrastructure, and on-time train arrivals are great. But how could you hold down a steady job if the train conductor told you when and which train to get on? It could be the 6 a.m. or it could be next week. Our data is mostly captive to a set of rules created and enforced (albeit with good reason) by people concerned about storage consumption, stuffing their machines, killing their network, and keeping the danger of losing data to a minimum.

Mass Transit for Data

We’ve invested in a mass transit infrastructure for data without receiving as many benefits as we could. What would change if data had a mass transit system that actually worked well?

Easy On, Easy Off

If data professionals spent a lot less time preserving and reproducing the state of a dataset and spent more time reproducing it at the touch of a button, that would mean getting on and off the “data train” would be just as easy for a dataset as it would be for a passenger.

High Frequency

Rather than waiting for days and sometimes weeks for fresh and secure data, what if developers, testers, and data analysts got their data on a regular and frequent basis like a train schedule — that would make everything faster.

Testing is often a function of reset time. When reset time is measured in minutes, testing density goes up a lot. If recovering from a data error (and not just a code error) is like waiting for the next Uptown 3 train in NYC, the cost of that error drops dramatically and more errors get caught sooner. Most importantly, when you can switch trains - that is - when you can swap out one dataset with another one in your library - and switch back as well - you create a situation where data is so fluid that data and, importantly, versions and conditions (like subsets and masking) - can be changed at will. That would dramatically change the time it takes to get a feature out the door or the time it might take to gain new insight.

Highly Connected

If instead of having to go through a heavy process to get the right dataset for your new build, or the right dataset for your new test suite, or to just swap out datasets to work on an old release you could have every version of every dataset you care about were at your fingertips in a dataset library and could literally be made ready in minutes? Moreover, what if you could distribute any version to anyone in minutes?

In many cases, the number one friction in a company’s feature factory is their data. We have to get to the point where moving our datasets from station to station is as simple as “stop” and “play." We have to get to the point where moving data around is just no big deal - you don’t need experts - and in fact you can do it yourself - self services. And, we need to standardize the way we manage our dataset library. We should take advantage of the fact that datasets are related to make it easier to track the many versions, and (more importantly) share them with ease. Sending the “data train” to a different station on a different line should be like changing tracks, not building the railroad.

Data is Free

If we make our data lightweight, we worry a lot less about the impact it has to storage and network. If we could turn on “autosave” for our datasets just like Google Docs, and we no longer had to worry about being able to restore any specific state or get our work back quickly, a lot of wait time and behaviors around “holding environments” just stop. Self-service datasets mean you - the consumer - get to pick which train the dataset gets on. It also means you can reliably get to a destination on time, and you don’t need to know in advance or coordinate with anyone else. That sets your data free.

DataOps: The Data Mass Transit System

The operating principle of DataOps is to improve outcomes by bringing together those who need data with those who provide it, thereby eliminating friction throughout the data lifecycle.

Like a poorly operating mass transit system, we have a lot of infrastructure, but it typically takes too long to get data on and off the train to its destination; our trains break down too much as the demand rises; our data terminals aren’t quite ready to handle the volume; lastly, our data spends a lot of time under the control of people who aren’t consuming it.

A DataOps platform provides several key capabilities to address these challenges:

  • By virtualizing data, copies become small and nimble and can be (re)provisioned rapidly; it also means keeping a library of all sorts of copies, versions, and releases is trivial.

  • Using self-service controls, consumers of data get direct control over their datasets.

  • Relying on shared datasets means forking datasets or making copies of copies is trivial; hence, sharing a dataset with someone else is irrelevant.

  • Recording change and giving point-in-time provisioning creates data version control.

DataOps-based solutions simply provide a smarter approach to data management. By eliminating the friction caused by heavy, brittle, disconnected, and captive datasets, a strong data management platform can turn the datasets that’s on your 8th Avenue local into a high speed rail. A well-run transit system brings broad and shared access to transportation at scale, and DataOps can do the same thing for your datasets at scale - bringing you massive reductions in time and cost, and significantly more convenience that can really amp up your productivity as a developer or as a company. DataOps lets you empower the passenger and lets you move the data at scale on your timeframe.

Download “DataOps Lays the Foundations for Agility, Security and Transformation Change” analyst report by 451 Research to learn why your enterprise needs a DataOps strategy, not just more data people.