De-identification with Healthcare Data - What the Salesperson Can’t Tell You!

Learn the one secret a salesperson can’t tell you when it comes to de-identifying your sensitive data.

Ilker Taskaya, Jeannine Crownover

Nov 06, 2018

Are you in the midst of evaluating a de-identification or data masking product? If yes, then you are well aware of the current industry mandates that encompass restrictions on the use of customer or patient personal identifiable Information (PII) for use in your non-production software and testing environments. The requirement is getting more prevalent beyond HIPAA compliance with every new data security regulation, such as New York DFS Cybersecurity or the more recent California Consumer Privacy Act.

You’ve also narrowed down the search to two to three vendors, and the salesperson for each product is very keen on helping you evaluate their tool by addressing your key evaluation points, such as the de-identification of PII. But one thing he or she cannot tell you is how to de-identify data in non-production systems and still keep business value. That will fall entirely on your shoulders.

Let’s take for example healthcare organizations that are tightly governed by privacy laws that include restricting access and distribution of patient and member’s electronic medical records (EMR) and personal health information (PHI) - all of which are needed for patient care and insurance claims processing but not necessary for the design, development and testing of a company's technology systems.

So how can you ensure patient privacy and regulatory compliance while allowing for technical flexibility and still keep business value?

Automatic Referential Integrity: The Holy Grail

Data must be de-identified consistently to preserve the important relationships in the data. This is especially important because healthcare data is heavily inferred or tied to other pieces of information, such as clinical edits. For example, the age of the person determines whether he or she is likely to suffer from a specific condition, i.e., a teenager cannot be diagnosed with Alzheimer's. Similarly, the gender of a person also determines what type of treatment he or she receives, i.e., a female patient seeking a maternity service.

The more broadly you apply data masking with referential integrity, the more value you should gain in generating test data. If you are building your test data in silos, you will end up with incompatible data sets.

The best way to achieve this outcome is having a product architecture which abstracts the de-identification of the data regardless of the data source. If you are de-identifying an ICD-10 code, it should be masked to exactly the same outcome based on the input production data across different type of data sources, such as mainframe, databases or EDI X12 data sources. In some cases, this needs to be further expanded to be covering applications in different on-premise data centers or in public clouds, such as AWS or Microsoft Azure.

Realistic but Fictitious Test Data: The ‘Empty Purse’

The resulting de-identified values must be usable for non-production use cases, including development and testing. You can’t simply de-identify names into a random string of characters nor can you randomly move names within the same column. Some names are well known such as ‘Gates,’ or unique so only few people have it within a given state. At the same time, to retain business value the generated test data should protect the data from inference while maintaining some of the semantics required for testing.

Date of birth is a great example of this. It’s the second most commonly used identification mechanism after someone’s name. The de-identification mechanism should change the birth date but still retain the patient’s age as of the date of masking to retain this value.

There you have it folks - some of the most important points to consider before investing in a data masking solution. The last thing you want is a rigid security tool that doesn’t align with your workflow and generates useless test data.

The Delphix de-identification technology allows healthcare organizations to leverage useful data without compromising privacy. No technique reduces the risk to zero, but risk can be greatly reduced without sacrificing the value of the data to the business.

Learn more about how you can safeguard confidential data through the Delphix Data Platform that provides an enterprise-wide approach to data de-identification and data virtualization capabilities that can help sync, mark and deliver your data securely and rapidly.

The DevOps Data Platform

Agile, DevOps, CI/CD

Modernization to Multicloud

Data Compliance & Security

Resource Center

Events & Webinars

Blog