What is Test Data Management?
Test data management (TDM) refers to the function that creates, manages, and delivers test data to application teams. Historically, application teams have manufactured data for development and testing in a siloed, unstructured fashion.
Current State of Test Data Management
In today’s digital era, every company must bring high-quality applications to market at an increasingly competitive pace. While companies have adopted agile and DevOps methodologies in pursuit of this goal, many have significantly underinvested in test data—which has emerged as a constraint in the race to innovate.
The TDM market has shifted to a new set of strategies, largely driven by an increased focus on application uptime, faster time-to-market, and lower costs. TDM is rapidly maturing alongside other IT initiatives such as DevOps and cloud.
Once viewed as a back-office function, test data management (TDM) is a critical business enabler for enterprise agility, security, and cost efficiency. As the volume of application projects increases, many large IT organizations are recognizing the opportunity to gain economies of scale by consolidating TDM functions into a single group or department—enabling them to take advantage of innovative tools to create test data and operate much more efficiently than siloed, decentralized, and unstructured TDM teams.
As increasing centralization has begun to yield large efficiency gains, the scope of TDM has since expanded to include the use of subsetting and synthetic data generation, and most recently, the use of masking to manipulate production data.
Common Test Data Challenges
Application development teams need fast, reliable test data for their projects, but many are constrained by the speed, quality, security, and costs of moving data across software development lifecycle (SDLC) environments. Below are the most common challenges that organizations face when it comes to managing test data.
Test environment provisioning is a slow, manual, and high-touch process
Most IT organizations rely on a request-fulfill model, in which developers and testers find their requests queuing behind others. Because it takes significant time and effort to create a copy of test data, it can take days, or even weeks to provision updated data for a test environment.
Often, the time to turn around a new environment is directly correlated to how many people are involved in the process. Enterprises typically have 4 or more administrators involved in setting up and provisioning data for a non-production environment. Not only does this process place a strain on operations teams, it also creates time sinks during test cycles, slowing the pace of application delivery.
Development teams lack high-fidelity data
Development teams often lack access to test data that is fit for purpose. For example, depending on the release version being tested, a developer might require a data set as of a specific point in time. But all too often, he or she is forced to work with a stale copy of data due to the complexity of refreshing an environment. This can result in lost productivity due to time spent resolving data-related issues and increases the risk of data-related defects escaping into production.
Data masking adds friction to release cycles
For many applications, such as those processing credit card numbers, patient records, or other sensitive information, data masking is critical to ensuring regulatory compliance and protecting against data breaches. According to the Ponemon Institute, the cost of a data breach—including the costs of remediation, customer churn, and other losses—averages $3.92 million. However, masking sensitive data often adds operational overhead; an end-to-end masking process may take an entire week, which can prolong test cycles.
Storage costs are continually on the rise
IT organizations create multiple, redundant copies of test data, resulting in inefficient use of storage. To meet concurrent demands within the confines of storage capacity, operations teams must coordinate test data availability across multiple teams, applications, and release versions. As a result, development teams often contend for limited, shared environments, resulting in the serialization of critical application projects.
Common Types of Test Data
No single technology exists that fulfills all TDM requirements. Rather, teams must build an integrated solution that provides all the data types required to meet a diverse set of testing needs. Once test data requirements have been identified, a successful TDM approach should aim to provide the appropriate types of test data, weighing the pros and cons of each.
Production data provides the most complete test coverage, but it usually comes at the expense of agility and storage costs. For some applications, it can also mean exposing sensitive data.
Subsets of production data are significantly more agile than full copies. They can provide some savings on hardware, CPU, and licensing costs, but it can be difficult to achieve sufficient test coverage.
Masked production data (either full sets or subsets) makes it possible for development teams to use real data without introducing unsafe levels of risk. However, masking processes can elongate environment provisioning. Also, masking requires staging environments with additional storage and staff to ensure referential integrity after data is transformed.
Synthetic data circumvents security issues, but the space savings are limited. While synthetic data might be required to test new features, this is only a relatively small percentage of test cases. If performed manually, creating test data is also prone to human error and requires an in-depth understanding of data relationships both within the database schema or file system, as well as those implicit in the data itself.
Best Practices for Test Data Management: How to Effectively Prepare Your Test Data
A comprehensive approach should seek to improve TDM in each of the following areas:
- Data delivery: reducing the time to deliver test data to a developer or tester
- Data quality: meeting requirements for high-fidelity test data
- Data security: minimizing security risks without compromising speed
- Infrastructure costs: lowering the costs of storing and archiving test data. The following sections highlight the top evaluative criteria for a TDM approach.
Creating a copy of production data for development or testing is often a time-consuming, labor-intensive process that usually lags demand. Organizations must build a solution that streamlines this process and creates a path towards fast, repeatable data delivery. Specifically, application team leaders should look for solutions that feature:
- Automation: Modern software toolsets already include technologies to automate build processes, infrastructure delivery, and testing, among other DevOps capabilities. However, organizations often lack equivalent tools for delivering copies of test data with the same level of automation. A streamlined TDM approach eliminates manual processes—such as target database initialization, configuration steps, and validation checks—providing a low touch approach to standing up new data environments.
- Toolset integration: An efficient TDM approach should unite a heterogeneous set of technologies, including masking, subsetting, and synthetic data creation. This requires both compatibility across test data tools and exposed APIs (or other clear integration mechanisms to DevOps tools) to enable a factory-like approach to TDM.
- Self-service: Instead of relying on IT ticketing systems, an advanced TDM approach puts sufficient levels of automation in place that enable end users to provision test data via self service. Self-service capabilities should extend not just to data delivery, but also to control over test data versioning. For example, developers or testers should be able to bookmark and reset, archive, or share copies of test data without involving operations teams.
Operations teams go through great efforts to make the right types of test data—such as masked production data or synthetic datasets—available to software development teams. As TDM teams balance requirements for different types of test data, they must also ensure data quality is preserved across three key dimensions:
- Data age: Due to the time and effort required to prepare test data, operations teams are often unable to meet a number of ticket requests. As a result, data often becomes stale in non-production, which can impact the quality of testing and result in costly, late-stage errors. A TDM approach should aim to reduce the time it takes to refresh an environment, making the latest test data more accessible.
- Data accuracy: A TDM process can become challenging when multiple datasets are required as of a specific point-in-time for systems integration testing. For instance, testing a procure-to-pay process might require that data is federated across customer relationship management, inventory management, and financial applications. A TDM approach should allow for multiple datasets to be provisioned to the same point in time and simultaneously reset between test cycles.
- Data size: Due to storage constraints, developers must often work with subsets of data, which aren’t likely to satisfy all functional testing requirements. The use of subsets can result in missed test case outliers, which can paradoxically increase rather than decrease project costs due to data-related errors. In an optimized strategy, full-size test data copies can be provisioned in a fraction of the space of subsets by sharing common data blocks across copies. As a result, TDM teams can reduce the operational costs of subsetting—both in terms of data preparation and error resolution—by reducing the need to subset data as frequently.
Masking tools have emerged as an effective and reliable method of protecting test data. By irreversibly replacing sensitive data with fictitious yet realistic values, masking can ensure regulatory compliance and completely neutralize the risk of data breach in test environments. But to make masking practical and effective, organizations should consider the following requirements:
- Complete solution: Many organizations fail to adequately mask test data because they lack a complete solution with out-of-the-box functionality to discover sensitive data and then audit the trail of masked data. In addition, an effective approach should mask data consistently while maintaining referential integrity across multiple, heterogeneous sources.
- No need for development expertise: Organizations should look for lightweight masking tools that can be set up without scripting or specialized development expertise. Tools with fast, predefined masking algorithms, for example, can dramatically reduce the complexity and resource requirements that stand in the way of consistently applying masking.
- Integrated masking and distribution: Only about 1 out of 4 organizations are using masking tools because of challenges delivering data downstream. To overcome this challenge, masking processes should be tightly coupled with a data-delivery mechanism. Many organizations will also benefit from an approach that allows them to mask data in a secure zone and then easily deliver that secure data to targets in non-production environments, including those in offsite data centers or in private or public clouds.
With the rapid proliferation of test data, TDM teams must build a toolset that maximizes the efficient use of infrastructure resources. Specifically, a TDM toolset should meet the following criteria:
- Data consolidation: It is not uncommon for organizations to maintain non-production environments in which 90% of the data is redundant. A TDM approach should aim to consolidate storage and slash costs by sharing common data across environments—including those used not only for testing, but also development, reporting, production support, and other use cases.
- Data archiving: A TDM approach should make it feasible to maintain libraries of test data by optimizing storage use and enabling fast retrieval. Data libraries should also be automatically version-controlled in the same way that tools like Git exist for code versioning.
- Environment utilization: At most IT organizations, projects are serialized due to contention for environments. Paradoxically, at the same time, environments are often underutilized due to the time to populate an environment with the appropriate test data. A TDM solution should decouple data from blocks of computing resources through intelligent use of “bookmarking.” Bookmarked datasets—which can exist as of any point in time—can be loaded into environments on demand, making it easier for developers and testers to effectively time-share environments. As a result, an optimized TDM strategy can eliminate contention while achieving up to 50% higher utilization of environments.
The Modern Approach to Test Data Management
By building a Data Platform for TDM, companies can transform how they manage and consume data. IT operations teams can mask and deliver data one hundred times faster while using ten times less space. The net result? More projects can be completed in less time using less infrastructure.
- Faster release cycles and time-to-market: 3.5 days to refresh an environment vs. 10 minutes via self-service
- Higher quality releases and reduced cost: 15% vs. 0% data-related defects
- Ensured data privacy and regulatory compliance: data secured in non-production
Explore how Delphix can help you test faster with greater confidence.
After reading this article you will be able to:
- Understand the current state of Test Data Management
- Learn about the common test data challenges
- Explore common types of test data
- Identify Test Data Management best practices
- Understand infrastructure costs