Why 2020 is the Year You Will Manage Your Data as Code

While developers have the ability to easily build, test, and deploy code with the rise of DevOps, they continue to struggle to manage data with the same speed and simplicity.

Derek Smart

Dec 19, 2019

“If it’s not broke, don’t fix it.” The speed at which people get used to problems is amazing, especially in IT. Humans have developed a wonderful ability to deal with problems so quickly and intuitively that we almost don’t even realize a problem exists. When things do get difficult, workarounds are in our nature.

On a small scale, this works just fine because some problems happen so infrequently or have limited impact. But having this mindset can be counterproductive to automation, where even simple tasks are repeated dozens, hundreds, and thousands of times over the course of a month, a week, or even a single day. Workarounds do not scale. We must be more diligent when identifying problems and not accept workarounds as solutions.

At one point in my career, I spent roughly 50 percent of my time pushing my code out to a server and thought I was doing great, until CI/CD came along. I began to realize I had done it all wrong.

CI/CD relies on an automated and reliable suite of tests, and automating a workaround does not scale nor does it help teams be more productive when shipping software with quality built in.

It’s only when you codify a process, such as migrating your schema or infrastructure deployments, that you realize the problems were inherent with your manual process and it becomes glaringly evident.

Seed Data and Mock Data Don’t Come Close to Reality

Similarly, data is still a very manual process. There are different types of data needed during application development, depending on where the code is in the pipeline. Early in the development process, when developers are writing code, they want to be able to test their code independently from the database.So they write mock data for their unit tests, and this mock data covers a very specific scenario for testing code.

As the application code moves closer to a production release, each phase of the CI/CD pipeline has some specific set of data required to test and validate the application for that phase. Developers have been addressing this problem with a variety of different data types, including seed data and mock data. These different types of data help validate an application before it ever sees a production release, but they create a number of limitations.

1. Seed data

The idea of using seed data for testing adds complexity to deployments. While seed data allows developers to create sets of data needed to run the application, it’s also required in production for the application to work appropriately. A common example of seed data in practice is having a database table with state names and abbreviations for different countries. So, when customers are filling out online forms with their home address, they have a drop down of states to choose from rather than trying to validate free form text against valid state abbreviations.

In a testing scenario, seed data is commonly used to create a set of data for testing and validating the application before it gets to prod. Now, there are multiple sets of seed data that serve different purposes. The complexity of managing these multiple sets of seed data falls on the automation team to understand the differences and observe changes.

2. Mock data

Mock data, on the other hand, is never intended to be released to production and is generally handwritten by developers to test their code. I’ve written a tremendous amount of mocks over the years, and I couldn’t tell you how many mock customers lived at “123 Main Street” in my hometown with my zipcode. Why? Because I have to write mocks over and over again, and I don’t like doing it. I write the fastest mock I can to satisfy the test and move onto writing more code.

Mock data is very simple, but it does not represent the variety of data that actual customers generate.

Nonetheless, it was very difficult to create reports and geolocation features for the apps I worked on when everyone lived in the same town and had the same address.

Providing Speed, Automation, and Security at the Data Layer

Taking prod data and making it available to lower non-prod environments is very cumbersome. Because prod data contains sensitive customer data that should not be shared in lower environments, it must be scrubbed, altered, and secured before sharing into lower environments. In non-production environments, security challenges are magnified as the number of data instances and internal users grow. Companies that adopt data masking techniques as part of a DataOps approach can obfuscate values that are anonymized but still retain referential integrity to perform meaningful software testing and derive smarter insights.

How do we bridge the gap between how we manage our code and how we manage our data during application development? The first step is to recognize that data has state, just as application code has state.

The data used to triage a bug is different from the data used to unit test code. Validating schema changes against handwritten seed data doesn’t accurately represent what will happen in prod.

Each phase of the CI/CD pipeline has a different requirement for data, some tests run in isolation against one dataset, while others require integration with several different services all of which need to be in the same state as the application being tested to be valid. If a problem with the code is found during continuous integration, consider the data shared with the development team to help them fix the issue. Just as release artifacts are moved to production, a consistent data state should move through the pipeline to ensure that each validation step is predictable and repeatable.

With self-service data controls, developers can access data at any point in time for the environment they need, without putting in requests to database administrators. For organizations to speed up software development, data processes have to be quick, painless, and most importantly secure. The bottom line? The more accurate the data and the earlier it is available in CI/CD, the sooner data-related defects can be found.

Manage the Complexity of Data with DataOps

During application development, developers need to have accurate and portable data sets, so they can focus on writing accurate high-quality code instead of writing mocks. The same datasets should be portable and available to different pipeline phases, so tests run quickly and accurately.

In short, quality is dependent on testing with production-like data. Dev teams should be able to manage the cadence in which their datasets are updated and be able to receive updates without manual intervention through self-service controls. Reducing the friction between developers and those who manage and secure data is what DataOps is focused on.

A DataOps platform addresses both aspects of the data challenge: speed and security. It delivers data from production to downstream environments and maintains the integrity of the datasets. Automation with APIs enables fast updates and eliminates manual scripts.

Developers will achieve true agility in their application workflows only when data is managed like code.

The most well-intentioned development projects can create a modern digital tragedy if shortcuts are taken when delivering data to developers or if data security practices are not carried out properly.