What is Data Masking?

Data masking, an umbrella term for data anonymization, pseudonymization, redaction, scrubbing, or de-identification, is a method of protecting sensitive data by replacing the original value with a fictitious but realistic equivalent. Data masking is also referred to as data obfuscation. 

Why is Data Masking Important? 

As IT leaders realize that data is key to building data-driven applications and software as well as unlocking competitive advantage, it’s becoming increasingly important to provide secure access to data that flows across an organization to innovate faster and at scale, without compromising privacy and security. 

The vast majority of sensitive data in an enterprise exists in non-production environments used for development and testing functions. Non-production environments represent the largest surface area of risk in an enterprise, where there can be up to 12 copies for non-production purposes for every copy of production data that exists. To test adequately, realistic data is essential, but real data is notorious for creating runs considerable data security risks. 

Data masking also eliminates the risk of personal data exposure in compliance with data privacy regulations. By following data masking best practices, companies have the ability to move data fast to those who need it, when they need it. 

Common Methods of Data Masking

Inplace Masking: Reading from a target and then updating it with masked data, overwriting any sensitive information.

On the Fly Masking: Reading from a source (say production) and writing masked data into a target (usually non-production).

Static Data Masking: Masking of data in storage removes any traces like logs or changes in data captures.

Dynamic Data Masking: This technique temporarily hides or replaces sensitive data in transit, leaving the original at-rest data intact and unaltered. It is primarily used to apply role-based (object-level) security for databases or applications in production environments, and as a means to apply this security to (legacy) applications that don’t have a built-in, role-based security model. It protects data in read-only (reporting) scenarios. It’s not intended to permanently alter sensitive data values for use in non-production environments. 

Synthetic Data Generation: This technique does not mask data. It generates new data in lieu of existing data, keeping the data structure intact. It’s used for scenarios like greenfield application development.

Common Data Masking and Data Security Techniques

Encryption: This method scrambles data using mathematical calculations and algorithms. It’s best used for securing data that needs to be returned to its original value, e.g., production data or data in motion. Encryption only offers data protection as long as the corresponding encryption keys are safe. A hacker who compromises the right keys is able to decrypt sensitive data, restoring it back to its original state. With data masking, there is no master key and scrambled data cannot be returned to its original values.  

Tokenization: Tokenization is another morphing of encryption which that generates stateful or stateless tokens. Most times, these can be re-identified.

Scrambling: This technique involves scrambling of characters or numbers, which does not properly secure sensitive data.

Nulling Out or Deletion: Changes data characteristics and takes out any usefulness in data.

Variance: The data is changed based on the ranges defined. It can be useful in certain situations, e.g., where transactional data that is non-sensitive needs to be protected for aggregations or analytical purposes.

Substitution: Data is substituted with another value. The level of difficulty to execute can range quite a bit. It’s the correct way to mask when done right.

Shuffling: Moving data within rows in the same column. This can be useful in certain scenarios, but data security is not guaranteed.

Redaction: This type of data masking requires Changing of all characters to be changed to the same character. Easy to do but data loses its business value.

Requirements Your Data Masking Solution Should Fulfill 

1. Referential Integrity: Application development teams require fresh, full copies of the production database for their testing. True data masking techniques transform confidential information and preserve the integrity of the data. 

For example, George must always be masked to Elliot or a given social security number (SSN) must always be masked to the same SSN. This helps preserve primary and foreign keys in a database needed to evaluate, manipulate and integrate the datasets, along with the relationships within a given data environment as well as across multiple, heterogeneous datasets (e.g., preserving referential integrity when you mask data in an Oracle Database and a SQL Server database).

2. Realistic: Your data masking technology solution must give you the ability to generate realistic, but fictitious, business-specific data, so testing is feasible but provides zero value to thieves and hackers. The resulting masked values should be usable for non-production use cases. You can’t simply mask names into a random string of characters.

3. Irreversibility: The algorithms must be designed such that once data has been masked, you can’t back out the original values or reverse engineer the data.

4. Extensibility & flexibility: The number of data sources continues to grow at an accelerated rate. In order to enable a broad ecosystem and secure data across data sources, your data masking solution needs to work with the wide variety of data sources that businesses depend on and should be customizable.

5. Repeatable: Masking is not a one-time process. , it Organizations should perform data masking should happen repeatedly as data changes over time. It needs to be fast and automatic while allowing integration with your workflows, such as SDLC or DevOps processes. 

Many data masking solutions often add operational overhead and prolongs test cycles for a company. But with an automated approach, teams can easily identify sensitive information such as names, email addresses, and payment information to provide an enterprise-wide view of risk and to pinpoint targets for masking. 

Unlike approaches that leverage encryption, masking not only ensures that transformed data is still usable in non-production environments, but also entails an irreversible process that prevents original data from being restored through decryption keys or other means.

With a policy-based approach, your data can be tokenized and reversed or irreversibly masked in accordance with internal standards and privacy regulations such as GDPR, CCPA, and HIPAA. Taken together, these capabilities allow businesses to define, manage, and apply security policies from a single point of control across large, complex data estates in real-time. 

Risk-Based Testing

The goal of any test data management (TDM) system is shift left testing to reduce defects in production systems and keep the business at optimal performance levels. Having the right TDM strategy is core to a successful DevOps strategy. Companies must be able to decide the best option for them and then use the optimal toolset to extract the maximum business value out of them. They should be able to tweak their release delivery pipelines based on the changes /new features introduced and execute faster cycles. The idea is to limit the effort to the risk being introduced. 

What are the Benefits of Data Masking? 

The whole point of security is to have data confidentiality where the users can be assured of the privacy of the data. Masking done right can protect the content of data while preserving business value. There are different metrics to measure the masking degree, most common being the K-Anonymity factor, but all considerations of using them should ensure shift left testing in order for data security and compliance to be achieved.

Unlike encryption measures that can be bypassed through schemes to obtain user credentials, masking irreversibly protects data in downstream environments. Consistent masking of data while maintaining referential integrity across heterogeneous data sources ensures the security of sensitive data before it is made available for development and testing, or sent to an offsite data center or the public cloud—all without the need for programming expertise. 

While there are plenty of masking technologies in the market, Delphix is the industry’s only API-driven data operations platform that combines data masking and data delivery. Download resources such as our “Protecting Sensitive Data with Data Masking” white paper for an in-depth look at data masking with the Delphix Data Platform. 



Suggested reading


Understanding Data Masking in Healthcare

Watch this webinar to learn how to overcome the complexities of de-identifying TriZetto healthcare data and healthcare EDI files to protect PII/PHI data.
White Paper

GDPR Requirements for Data Masking

What IT needs to know and do.