Here’s How to Succeed with Securing Enterprise Data for Hundreds of Applications in Non-Production
Another week, another massive breach. This one has my personal data along with millions of others. We all realize the challenges with securing our data. Recent research demonstrates that it is not enough to anonymize a subset of the demographics for individuals. It’s now common knowledge that one someone’s identity can be inferred using just a few pieces of real information, such as date of birth and postal code. But to top it all off, the digital universe of data continues to grow exponentially every year.
Data in non-production environments represents a significant percentage of total enterprise data volume, and it also carries more risk than production environments because of the number of direct internal users. However, there are a number of ways enterprise teams can reduce their data risk footprint and achieve high-scale data transformation in their non-prod environments in the world of test data management (TDM).
Removing Real Data From Non-Production at Scale is Hard
A company could easily have hundreds of schemas/applications, 500,000+ objects, and millions of columns. But when they think about removing real data from non-production environments, they’ll start with a handful of applications, spending on average $1 million just to secure 15-20 applications. So the question becomes, how can you succeed with your data security objectives in non-prod environments?
In the words of one customer, “All we’re doing is throwing people at this problem, and we’re not succeeding anymore than we did last year.”
Here are 4 best practices to integrate data security to test data management at scale:
1. Policy-Based Data Masking
Abstract how you transform data from how you read and write data. The policy can be based on regulations, such as the GDPR, HIPAA, LGPD, or customized for your company. The policy should indicate what is sensitive, and how the data should be protected – regardless of the data source type or data residency. In other words, if you have multiple environments across data centers in disparate locations, there must be consistency across each and every one of those data sets. How you read and write from data sources should be separate from how you change the data. The lack of a data source agnostic solution makes it incredibly difficult to secure data consistently with integrity for a heterogeneous data topology successfully.
2. Transformations Hinged on Data Instances
Initially, teams write masking algorithms fixed for a data instance such as first names, addresses, and national identifiers. But the moment the data you want to use for your test environment changes, it becomes an effort in which you have to restart. Instead, write a transformation for strings/char data types, numeric data types, object data types, and date data types. Then set the transformation input to what you want your target data set to be – whether it’s double byte Japanese or single byte German first names. Your algorithm should take care of the security of the data and distribute what you want the end result to be randomly and consistently with performance.
3. Deterministic Masking
Establish the output based on the input. Integrated data sets are the gold standard in test data environments. Whatever algorithms or scripts you’re writing, ensure they end up with the same output based on the input. The same is true with data of birth; if you’re going to keep someone’s age the same, make sure the algorithm ensures the person’s age is the same because data has value. That way, you get an output that is consistent based on your input. When you’re able to do that, that means the same Jennifer in Hong Kong, New York, AWS or on-prem will always end up as Mary, consistently across all data sets, sources, and environments.
4. Virtualization > Subsetting
Subsetting allows for volume reduction. While it’s feasible to achieve subsetting in small volumes of data sets, such as a dozen or so applications, you’ll end up with having a programming effort that’ll take weeks for each target application. Then multiply that effort with hundreds of applications, and it becomes clear how hard subsetting is in real life. Instead, virtualize your data to deliver terabytes of data in minutes using tiny quantities of storage. Combining virtualization with data masking provides the best of both worlds. The data is delivered in minutes without the storage cost, already secured.
The Future of Test Data Management
There has been an explosion of new data sources in the last decade. We have seen a big shift from commercial OLTP databases to open source data sources with commercial support. Then the public cloud providers have changed how these data sources are used and managed. The latest real-time, event-based data sources, such as Kafka/Confluent, continue to push the envelope about our expectations on how we process data, what we call a database, and how much data we can process in real-time.
Non-production databases tend to be in the forefront of this movement since the new development and testing takes place there. We expect the enterprise data sources which contain the most sensitive data such as PII, PHI, and PCI to still be in an OLTP and mainframe environments on a commercial and open source licensing models – whether on-prem or on the cloud. These are still the environments which source the data for large BI/OLAP environments. However, the expansion of real-time data, and the focus on APIs has really taken root. Having said that, we expect the TDM market to shift from the database to network and manage security of that data on API end-points as well as Kafka topics.