Blog

Here’s How to Succeed with Securing Enterprise Data for Hundreds of Applications in Non-Production

Thumbnail
Discover how your enterprise teams can succeed with data security and achieve high-scale transformation in non-prod environments, ultimately boosting your efforts with test data management.

Another week, another massive breach. This one has my personal data along with millions of others. We all realize the challenges with securing our data. Recent research demonstrates that it is not enough to anonymize a subset of the demographics for individuals. It’s now common knowledge that one someone’s identity can be inferred using just a few pieces of real information, such as date of birth and postal code. But to top it all off, the digital universe of data continues to grow exponentially every year.  

Data in non-production environments represents a significant percentage of total enterprise data volume, and it also carries more risk than production environments because of the number of direct internal users. However, there are a number of ways enterprise teams can reduce their data risk footprint and achieve high-scale data transformation in their non-prod environments in the world of test data management (TDM). 

Removing Real Data From Non-Production at Scale is Hard 

A company could easily have hundreds of schemas/applications, 500,000+ objects, and millions of columns. But when they think about removing real data from non-production environments, they’ll start with a handful of applications, spending on average $1 million just to secure 15-20 applications. So the question becomes, how can you succeed with your data security objectives in non-prod environments?

In the words of one customer, “All we’re doing is throwing people at this problem, and we’re not succeeding anymore than we did last year.” 

Here are 4 best practices to integrate data security to test data management at scale:  

1. Policy-Based Data Masking

Abstract how you transform data from how you read and write data. The policy can be based on regulations, such as the GDPR, HIPAA, LGPD, or customized for your company. The policy should indicate what is sensitive, and how the data should be protected – regardless of the data source type or data residency. In other words, if you have multiple environments across data centers in disparate locations, there must be consistency across each and every one of those data sets. How you read and write from data sources should be separate from how you change the data. The lack of a data source agnostic solution makes it incredibly difficult to secure data consistently with integrity for a heterogeneous data topology successfully.  

2. Transformations Hinged on Data Instances

Initially, teams write masking algorithms fixed for a data instance such as first names, addresses, and national identifiers. But the moment the data you want to use for your test environment changes, it becomes an effort in which you have to restart. Instead, write a transformation for strings/char data types, numeric data types, object data types, and date data types. Then set the transformation input to what you want your target data set to be – whether it’s double byte Japanese or single byte German first names. Your algorithm should take care of the security of the data and distribute what you want the end result to be randomly and consistently with performance.

3. Deterministic Masking

Establish the output based on the input. Integrated data sets are the gold standard in test data environments. Whatever algorithms or scripts you’re writing, ensure they end up with the same output based on the input. The same is true with data of birth; if you’re going to keep someone’s age the same, make sure the algorithm ensures the person’s age is the same because data has value. That way, you get an output that is consistent based on your input. When you’re able to do that, that means the same Jennifer in Hong Kong, New York, AWS or on-prem will always end up as Mary, consistently across all data sets, sources, and environments. 

4. Virtualization > Subsetting   

Subsetting allows for volume reduction. While it’s feasible to achieve subsetting in small volumes of data sets, such as a dozen or so applications, you’ll end up with having a programming effort that’ll take weeks for each target application. Then multiply that effort with hundreds of applications, and it becomes clear how hard subsetting is in real life. Instead, virtualize your data to deliver terabytes of data in minutes using tiny quantities of storage. Combining virtualization with data masking provides the best of both worlds. The data is delivered in minutes without the storage cost, already secured. 

The Future of Test Data Management 

 There has been an explosion of new data sources in the last decade. We have seen a big shift from commercial OLTP databases to open source data sources with commercial support. Then the public cloud providers have changed how these data sources are used and managed. The latest real-time, event-based data sources, such as Kafka/Confluent, continue to push the envelope about our expectations on how we process data, what we call a database, and how much data we can process in real-time. 

Non-production databases tend to be in the forefront of this movement since the new development and testing takes place there. We expect the enterprise data sources which contain the most sensitive data such as PII, PHI, and PCI to still be in an OLTP and mainframe environments on a commercial and open source licensing models – whether on-prem or on the cloud. These are still the environments which source the data for large BI/OLAP environments. However, the expansion of real-time data, and the focus on APIs has really taken root. Having said that, we expect the TDM market to shift from the database to network and manage security of that data on API end-points as well as Kafka topics.

Suggested reading

Thumbnail
Blog

Secure All Software Environments, Not Just Production

For each production instance of an application, there are at least five, in some cases 10 or 20, instances existing in non-production environments. That’s why it’s critical for enterprise IT to pay closer attention to the vast majority of sensitive data existing within non-production environments.
Thumbnail
Blog

De-identification with Healthcare Data - What the Salesperson Can’t Tell You!

Learn the one secret a salesperson can’t tell you when it comes to de-identifying your sensitive data.
Thumbnail
Blog

How to End the Wait for Test Data

Learn why agile data delivery is a cornerstone for agile test data management practices.