Application Development

5 Ways to Optimize Data for Software Development

Why a successful DevOps strategy hinges on data.

Lenore Adam

Jan 10, 2022

https://a.storyblok.com/f/137721/1200x628/48774f8e90/2_optimize_data_blog_1200x628_1.png
https://a.storyblok.com/f/137721/1200x628/48774f8e90/2_optimize_data_blog_1200x628_1.png

Automated CI/CD toolchains and cloud programmability give DevOps teams game-changing flexibility in application development. With this unprecedented agility, they have established efficient, repeatable software delivery processes to streamline work in progress and improve productivity.

But while app dev teams have automated builds and deployment into CI/CD pipelines, getting test data into this automated toolchain relies on clunky ticketing processes and manual interventions. Data places tremendous drag on release velocity, but is surprisingly overlooked when evaluating the speed of the value stream.

Automated code and infrastructure help accelerate delivery, but an efficient release pipeline requires automated infrastructure, code, and data.

The speed of the software delivery life cycle depends on how quickly you can provision all three components. So what specifically do we mean by automating data? We’re talking about automating legacy processes for provisioning, refreshing, and securing test data for the entire software delivery lifecycle.

Here are five ways to optimize data operations for software development with code examples using the Delphix dxtoolkit and dxm-toolkit.

1. Keep Pace With Ephemeral Test Environments

To achieve faster development cycles, whether in two week sprints or in a continuous integration model, you’re going to be breaking code more often and in rapid succession, leading to an almost continuous testing cadence, which in turn increases demand for test beds.

With an automated cloud-based infrastructure, test beds are easily spun up and down to match the non-stop cadence of code changes. One of our customers told us that the average life of their ephemeral test environments is just 21 minutes!

Data operations have to be automated in order for the data pipeline to keep pace with speed and volume of test environment creation. Delphix virtual databases can easily be provisioned into and removed from test environments:

./dx_provision -type oracle -sourcename PROD_CRM -targetname TEST_CRM -timestamp LATEST ./dx_remove -name TEST_CRM

2. Establish Data Agility in CI/CD Pipelines

Waiting even a few minutes for test data breaks the efficiency of an automated CI/CD pipeline, but the reality is that wait states for test data can be weeks or even months. And when destructive testing requires rewinding data to a previous state, the clock starts ticking again.

Using nimble virtual databases provisioned by Delphix, DevOps teams can take those outdated data operations and automate them with just a few lines of code.

For example, restoring a virtual database after destructive testing is a simple command executed without the need for DBA resources or IT tickets:

./dx_rewind_db -name TEST_CRM -timestamp '2021-10-14 00:41'

When DevOps teams codify data, they begin to think about their pipelines in completely different ways, such as testing multiple code changes across identical testbeds, or turning traditional serial testing into highly efficient parallel testing.

3. Create Faster, Higher Quality Feedback Loops

Leveraging virtual databases and automating data operations greatly improves the efficiencies of the pipeline. But we need to also consider the quality of the data that is being delivered. In the DevOps realm, we often talk about shift left testing, meaning testing early and often to facilitate rapid feedback loops. An important element though is to leverage test environments that are as close to production as possible to ensure comprehensive test coverage. That includes test data, which should also reflect the full production instance.

Maintaining a continuous flow of production-quality data to test environments is essential at every test stage, especially when databases have a high rate of change. Without rapid delivery of quality test data, teams will use stale data or resort to subsets just to keep the release train moving. They risk missing edge cases or complex data scenarios, sending issues further downstream where they are harder to triage and more expensive to fix.

There is a significant amount of waste and rework caused by the right data not being available to developers and testers when they need it. Shifting left with production-quality data both improves software quality and increases developer productivity. Refreshing a Delphix virtual database with high-fidelity data is straightforward:

./dx_refresh_db -name TEST_CRM

4. Ensure Data Security is Not a Barrier to Innovation

Another area where data needs to keep pace with DevOps flexibility and speed is in securing sensitive data that resides in all of these non-production environments. They present an enormous attack surface for bad actors inside the organization or from hackers infiltrating the IT system.

Unfortunately, sensitive information in non-prod environments goes largely unprotected. Organizations often find that the process of anonymizing or masking sensitive data to be at odds with the speed of DevOps workflows and therefore viewed as a barrier to innovation.

We also see many companies attempt to use a homegrown masking solution, where they manually go through hundreds, if not thousands, of tables to discover sensitive columns, then use brittle scripts to obfuscate the data. We describe this as a cracked dam, meaning these poorly executed processes leave organizations at high risk of inference attacks and sensitive data leakage.

Adding to the challenge is also a growing need for more data-ready environments for business analytics or machine learning modeling. Outsourcing development or use of third parties for additional data processing expands the risk even further.

Without a comprehensive masking solution, businesses are basically distributing more and more sensitive data. Manual processes aren’t a sustainable solution for all the sensitive data in lower level environments, where data is copied over and over.

There's a crucial need for compliance automation so security is no longer a hindrance to innovation. With Delphix, it’s simple to automate discovery and anonymization of sensitive data. We also provide regulation-specific algorithms to support compliance efforts. Here’s an example of how to automate CCPA compliance with Delphix:

./dxmc profilejob start --jobname PII_SCAN ./dxmc job start --jobname CCPA_MASKING

5. Synchronize Diverse Datasets for Complex Test Environments

DevOps teams drive innovation across a highly interconnected web of applications and services that require sophisticated systems integration testing before deploying into production.

The data sources that make up the enterprise landscape are continually growing in complexity for the following reasons:

  • Applications and the underlying databases are housed in many locations, on-prem and across clouds.

  • There are different infrastructure models to consider when dealing with data in the cloud. Access to data is quite different depending on whether the data resides in IaaS, PaaS databases, or SaaS applications.

  • There’s also increasing diversity of data sources as data architects choose more fit-for-purpose databases over a monolithic data source.

That means prod data and non-prod environments are not necessarily in the same location, and delivering data for integrated test environments can be really challenging.

This chart helps illustrate the diversity of data sources in the enterprise. We see that despite the trend towards cloud adoption, over half of the respondents said that either their legacy apps and databases aren’t supported in the cloud, or they are just too large and complex to migrate. So there will continue to be a hybrid model of on-prem and cloud based applications and databases.

5 Ways to Optimize Data for Software Development

Source: Pulse Survey, 2020

In this next chart, we see the specific trends for those data sources that have been moved to the cloud, where we see increasing use of PaaS databases, open source technologies, as well as fit-for-purpose data sources:

5 Ways to Optimize Data for Software Development

Source: Pulse Survey, 2020

All this adds to the complexity in creating quality Integration testing environments for complex enterprise application stacks. Testing transactions and functionality that traverse these interconnected systems requires coordination and synchronization of this diverse set of data environments.

Configuring adequate integration testing environments is often one of the hardest problems to solve for DevOps teams developing complex apps and business workflows. Virtual databases change the physics of data, making data lightweight and mobile. That means heterogeneous sources can be easily transported when production and non-production environments are not co-located.

The various virtual databases can be synced to the same point in time so dev and test teams can time travel them as a single unit. Here’s an example of how to create a data group for unified time travel.

First, we create a container that groups a TEST_CRM database with a TEST_BILLINGS database into GROUPED_DBS:

dx_ctl_js_container -action create -container_def TEST_CRM -container_def TEST_BILLINGS -container_name GROUPED_DBS -template_name combined_apps -container_owner USER-1

Then we can operate on the container:

dx_ctl_js_container -container_name GROUPED_DBS -action refresh

The benefits of data automation for DevOps both increase the speed of delivery and improve the quality of data available throughout the CI/CD pipeline, resulting in exponential gains in efficiency without sacrificing security.

Watch this webinar to learn why API-driven data delivery is the secret weapon to optimize CI/CD pipelines.