What is Test Data Management?

Test Data Management (TDM) is the process for providing controlled data access to modern teams throughout the Software Development Lifecycle (SDLC).

Why is TDM important?

Modern Test Data Management solutions help organizations accelerate application development speed, code quality, data compliance, and sustainability initiatives by providing timely access to fresh relevant data downstream for code development, automated tests, troubleshooting, and validation.

What can Test Data Management tools do?

Test Data Management involves the synchronization of multiple data sources from production, versioning copies, sensitive data discovery, compliance masking data, and multicloud distribution of test data to support agile development and automated testing.

Managing Sensitive Data

A test data management tool helps CIO and CISO teams to administer security controls such as data masking, authorization, authentication, fine grained data access management, and audits logs in downstream environments as part of test data management processes. This helps organizations quickly meet compliance and data privacy regulations at test data provisioning, while also reducing data friction for AppDev and software test teams.

State of Test Data Management Tools

Test Data Needed

Modern DevOps teams need high quality test data based on real production data sources for software testing early in the SDLC. This helps development teams bring high-quality applications to market at an increasingly competitive pace.

Data for DevOps

Though many organizations have adopted agile software development and DevOps methodologies, there has been an underinvestment in test data management tools—which has constrained innovation.

Accelerate DevOps Initiatives

Accelerate DevOps Initiatives with Test Data Management

Modern DevOps teams are focused on improving system availability, reducing time-to-market, and lowering costs. Test data management helps organizations accelerate strategic initiatives such as DevOps and cloud by greatly improving compliant data access across the SDLC. Test data management improves software development speed, code quality, data compliance, and sustainability initiatives.

Common Test Data Challenges

Application development teams need fast, reliable test data but are constrained by the speed, quality, security, and costs of moving data to environments during the software development lifecycle (SDLC). Below are the most common challenges that organizations face when it comes to managing test data.

Test environment provisioning is a slow, manual, and high-touch process

Most IT organizations rely on a request-fulfill model, in which developers and testers find their requests queued behind others. Because it takes significant time and effort to create test data, it can take days, or even weeks to provision updated data for an environment.

Often, the time to turn around a new environment is directly correlated to how many people are involved in the process. Enterprises typically have 4 or more administrators involved in setting up and provisioning data for a non-production environment. Not only does this process place a strain on operations teams, it also creates time sinks during test cycles, slowing the pace of application delivery.

Development teams lack high-fidelity data

Development teams often lack access to test data that is fit for purpose. For example, depending on the release version being tested, a developer might require a data set as of a specific point in time. But all too often, he or she is forced to work with a stale copy of data due to the complexity of refreshing an environment. This can result in lost productivity due to time spent resolving data-related issues and increases the risk of data-related defects escaping into production.

Data masking adds friction to release cycles

For many applications, such as those processing credit card numbers, patient records, or other sensitive information, data masking is critical to ensuring regulatory compliance and protecting against data breaches. According to the Ponemon Institute, the cost of a data breach—including the costs of remediation, customer churn, and other losses—averages $3.92 million. However, masking sensitive data often adds operational overhead; an end-to-end masking process may take an entire week because of complexity for managing referential integrity across multiple tables and databases.

Storage costs are continually on the rise

IT organizations create multiple, redundant copies of test data, resulting in inefficient use of storage. To meet concurrent demands within the confines of storage capacity, operations teams must coordinate test data availability across multiple teams, applications, and release versions. As a result, development teams often contend for limited, shared environments, resulting in the serialization of critical application projects.

Common Types of Test Data

There are four common ways to create test data for application development teams and testing teams in the SDLC.

Production Data

Common Types of Test Data image

Real data from production environments provide the most complete test coverage, but can add friction without modern DevOps TDM tooling because of security controls around sensitive data.

Data Subsets

Test data subsets can improve static test performance while providing some saving on compute, storage, and software licensing costs. However, subsets do not provide sufficient test coverage for system integration testing needs. Subsets intrinsically omit test cases and contains sensitive values because it's still a direct copy of production values.

Masked Data

Masked Data Capabilities with Delphix image

Production data obfuscation using masking techniques helps teams leverage existing data in a compliant manner to quickly provision test data that meets regulatory requirements such as PCI, HIPAA, and GDPR.

Masking takes all the data from production, leverages algorithms to identify sensitive data, applies data obfuscation of PII and sensitive fields while keeping only relevant data for testing. This enables test data provisioning of realistic values without introducing unsafe levels of risk.

Synthetic Data Generation

Synthetic data intrinsically contains no personally identifiable information or sensitive information. This makes synthetic data creation an appealing choice for initial prototyping of new features or model exploration of test data sets.

Synthetic data generation typically involves mathematically computing values or selecting list items using algorithms to match a statistical distribution.

While synthetic data can help with initial unit tests, it cannot replace complete data sets that are needed throughout the testing process. Realistic data from production contains valuable test cases that are necessary to validate applications early and often to shift left issues in the SDLC.

Best Practices for Test Data Management

A comprehensive approach should seek to improve test data management in the following areas:

  • Data delivery: reducing the time to deliver test data to a development team or test team

  • Data quality: meeting requirements for high-fidelity test data

  • Data security: minimizing security risks without compromising speed

  • Infrastructure costs: lowering the costs of storing and archiving test data.

Data Delivery

Creating copies of real data from production environments for development or testing is typically a time-consuming, labor-intensive process that lags demand. Modern organizations need streamlined, repeatable processes for fast data delivery featuring:

  • Automation: Modern DevOps toolchains typically include technologies to automate build processes, infrastructure delivery, and testing. However, organizations often lack equivalent tools for delivering test data at the same level of automation. A streamlined test data management approach eliminates manual processes—such as target database initialization, configuration steps, and validation checks—providing a low touch approach for new ephemeral data environments.

  • Toolset integration: A modern test data management approach should unify technologies for data versioning, data masking, data subsetting, and synthetic data creation. This requires that tools have open APIs or direct integrations to fully enable automated declarative workflows for both infrastructure and data.

  • Self-service: Instead of relying on IT ticketing systems, a modern test data management approach leverages automation to enable users to provision test data on-demand. Self-service capabilities should extend not only to test data delivery, but also to versioning, bookmarking, and sharing. Individuals should be their own test data manager, leveraging features such as bookmark, refresh, rewind, archive, and share without waiting on Data Administrators or involving IT Operations teams.

Data Quality

When IT Operations teams are creating test data—such as masked production data or synthetic datasets—they must balance requirements on three key dimensions:

TEST Data age

Due to the time and effort required to prepare test data, operations teams are often unable to fulfill ticketed demand. As a result, data often becomes stale in non-production, which can impact test quality and result in costly, late-stage errors. A TDM approach should aim to reduce the time it takes to refresh an environment, making the latest test data more accessible.

TEST Data accuracy

A TDM process can become challenging when multiple datasets are required as of a specific point-in-time for systems integration testing. For instance, testing a procure-to-pay process might require that data is federated across customer relationship management, inventory management, and financial applications. A TDM approach should allow for multiple datasets to be provisioned to the same point in time and simultaneously reset between test cycles.

TEST Data size 

In the interest of reducing storage footprints, developers may sometimes consider using data subsets in an attempt to improve agility. However, subsets can't satisfy all functional testing requirements–resulting in missing test cases and shifting issues right in the SDLC–increasing overall project costs .

A modern TDM solution should look to reduce the number of unmonitored copies of test data across environments, enable the sharing of common data blocks across similar copies (saving on storage), and reduce manual processes with increased workflow automation to save on operating costs.

Data Security

Masked Data and Data Security with Delphix image

Masking tools have emerged as an effective and reliable method of protecting actual data from production. By irreversibly replacing sensitive data fields with fictitious yet realistic data values. Masking ensures regulatory compliance by completely neutralizing the risk of data breach in test environments. To make masking practical and effective, organizations should consider the following requirements:

Complete solution 

Many organizations fail to adequately mask test data because they lack a complete solution with out-of-the-box functionality to discover sensitive data and then audit the trail of masked data. In addition, an effective approach should mask testing data consistently while maintaining referential integrity across multiple, heterogeneous sources.

No need for development expertise 

Organizations should look for lightweight masking tools that can be set up without scripting or specialized development expertise. Tools with fast, predefined masking algorithms, for example, can dramatically reduce the complexity and resource requirements that stand in the way of consistently applying masking.

Integrated masking and distribution

Only about 1 out of 4 organizations are using masking tools because of challenges delivering data downstream. To overcome this, masking processes should be tightly coupled with data delivery.

Organizations will benefit from an approach that allows them to mask data in a secure zone and then easily distribute compliant data to non-production environments, including those in offsite data centers or public clouds.

Infrastructure Costs

With the rapid proliferation of test data, TDM teams must build a toolset that maximizes the efficient use of infrastructure resources. Specifically, a TDM toolset should meet the following criteria:

Data consolidation

It is common for organizations to maintain non-production environments where 90% of the data is redundant. A TDM approach should aim to consolidate storage and slash costs by sharing common data across environments—including those used not only for testing, but also development, reporting, production support, and other use cases.

Data archiving

A TDM approach should make it feasible to maintain libraries of test data by optimizing storage use and enabling fast retrieval. Data libraries should be automatically version-controlled in the same way that tools like Git exist for code versioning.

Contention Reduction

At most IT organizations, data access is serialized due to contention in shared software testing environments during working hours. Paradoxically, environments are often underutilized across the entire testing process because systems are left running when not used because of the time to populate a new environment with configurations and test data. A modern TDM approach should enable the ephemeral use of instantly accessible data from any point in time.

Ephemeral Data EnvIronments

Users should be able to bookmark data, tear down infrastructure environments, and redeploy a new data environment populated by a bookmark in minutes using their test data management tools. This eliminates shared resource contention during peak times, leverages automation to free up resources during off peak times, and enables the parallelization of individual data sandbox environments.

An optimized TDM strategy can eliminate contention while achieving up to 50% higher utilization of resources.

The Modern Approach to Test Data Management

 Test Data Management Production and Non-Production Solution image

By using a modern DevOps TDM approach, organizations can transform how teams manage and consume appropriate test data. IT operations can mask and deliver data one hundred times faster while using ten times less space. The net result? More projects can be completed in less time using less infrastructure.

  • Faster release cycles and time-to-market: 3.5 days to refresh an environment vs. 10 minutes via self-service

  • Higher quality releases and reduced cost: 15% vs. 0% data-related defects

  • Ensured data privacy and regulatory compliance: data secured in non-production

Explore how Delphix can help you test faster with greater confidence.