Introduction to Secure Lookup, a Data Masking Algorithm
Modern computer systems contain a great deal of sensitive data. Think about the sort of data your bank or health insurance company has stored in their systems. These businesses have everything from financial transactions to medical diagnoses. Meanwhile, developers and test engineers working to enhance and test the software backing these systems need realistic data to test with.
In the past, it was acceptable to either synthesize data, or give developers a copy of production data. Today, systems are more complex and process more sensitive information, so data synthesis and direct use of production data is no longer acceptable. Synthetic data is usually not very realistic, which means it doesn’t provide high quality test coverage, and production data contains data that cannot be shared with developers and testers for legal and moral reasons.
The Delphix Masking Engine provides a suite of algorithms, such as secure lookup, to mask sensitive data. Unlike traditional encryption, most masking algorithms are designed to be irreversible, meaning they purposely destroy information so the original data is not retrievable from the masked dataset. Secure lookup is designed to mask data consistently but irreversibly.
Masking Consistently and Irreversibly
The secure lookup algorithm takes the input string (i.e., the unmasked sensitive data), applies a hash function and uses the result of that hash to index into a list of possible output values (as shown in Diagram 1). This algorithm is configurable in two important ways: the output list and the hash function.
While the Delphix Masking Engine includes several pre-configured secure lookup algorithms that include an output value list, the best practice is for customers to provide their own list so it’s unique to each deployment. Additionally, it allows the output to be customized to match application and user expectations. For example, a masking algorithm used to mask names in a database at a Spanish business can be configured to produce Spanish first and last names in the masked dataset.
The hash function depends on a seed value, which is simply a random number the hash function takes as one of its inputs, that can be changed by the user. Changing the seed changes the mapping between input values and output values. If they wish, customers can periodically rotate the algorithm seeds similar to rotating authentication keys or encryption certificates.
The security of secure lookup comes from two properties. One is simply by overwriting the sensitive data with fake data. Since secure lookup is using a hash to determine how to mask each value, it will mask consistently, meaning that a given input string will always produce the same output unless the seed or the list is changed.
For instance, if “Tom” is masked to “Mary” in one usage of the algorithm, it will mask to “Mary” in every other usage of the algorithm. It’s important that we not generate a one-to-one correspondence mapping from inputs to outputs; mathematically, we want to avoid a one-to-one, or bijective masking function because the algorithm would be reversible if the seed and the output list were compromised.
Secure lookup achieves this by applying the pigeonhole principle. If the algorithm is configured with an output list that is smaller than the number of unique values in the unmasked data, then the algorithm is guaranteed to generate collisions and thus avoids generating a one-to-one correspondence mapping.
When talking about hash functions in a security context, algorithms like MD5 and SHA-256 are often discussed. Unlike most uses of those algorithms, we aren’t exposing the output of the hash function but rather we’re using it to index into a table of values. It doesn’t really matter whether the output of our hash function can be reverse engineered or what the probability of a collision is. We don’t expose the hash value, or even store it, and collisions are one of the things we depend on for security.
Securing Secure Lookups
We commonly get questions from users asking why an input value sometimes masks to itself. If the output list contains values that are also in the input data, then it’s possible for the hash of “Nate” to index to the “Nate” entry in the output list. A common misconception is that this is insecure. However, if our algorithm avoided such “self-mappings,” it would, in fact, be less secure.
An attacker looking at the masked data will not have any way to know that the unmasked value for “Nate” happens to, by coincidence, be “Nate.” However, if the algorithm worked to avoid self-mappings, then an attacker who saw “Nate” in the output data could conclude that the original value for that name was NOT “Nate.” Avoiding self-mappings leaks information, whereas allowing them does not.
Combined with other information in the masked dataset or from external sources, some of which might be unmasked, an attacker might be able to use the process of elimination to infer more information about the data. We don’t want to give attackers any clues.
As a simple example, suppose our input set has only two values in a particular column, ‘TRUE’ and ‘FALSE,’ and we set up a secure lookup algorithm to mask this column. The application that will consume the data likely expects to only find ‘TRUE’ and ‘FALSE’ in that column, so our output list will include only those two values. If we use a version of secure lookup that avoids self-mapping, there is only one possible mapping.
Such an algorithm effectively reveals the entire dataset to anyone who knows that it was masked with an algorithm that avoids self-mapping. Compare that result to the one that used the actual Delphix secure lookup algorithm allowing self-mapping. Now an attacker does not know which of the two possible mappings was used.
Of course, a secure lookup algorithm with only two values isn’t very secure in any case. But if the algorithm prevents self-mappings, then we’re telling attackers that the value in the masked data set is certainly not the real value, which is more than they know if we allow self-mapping.
Bigger is better, right?
We often see customers trying to use enormous output lists for secure lookup. While our product can handle reasonably large lists, its capacity is finite. Thinking about the security properties discussed earlier, the larger the output list is, the less information the algorithm is destroying.
Let’s consider a simple example. Suppose we’re data masking a column in a dataset that contains 10 unique values, and we mask it using secure lookup lists of size 5, 10 and 50. The results might look like this, and in fact these were generated by the Delphix Masking Engine.
The algorithm with only 5 entries has generated many duplicates while the algorithm with 50 entries has generated a unique mapping for each input value. Imagine an attacker who has a copy of the masked data and has access to the Delphix Masking Engine that generated it, or an identically configured one. Employees like this exist within many of the companies that use our product.
The person running the masking job is not necessarily permitted to see the unmasked data. In the 50 entry case, the attacker only needs to create a masking job using that algorithm and give it the first names of some of their coworkers or their company’s customers to re-identify the data. In the 5 entry case where there are many duplicates, having access to the masking engine does not give the attacker an easy path to identifying which rows in the database belong to which person.
Having duplicate mappings also works to thwart frequency based attacks. If an attacker is trying to identify the data for someone with, a one-to-one mapping helps them. For example, suppose an attacker is looking in a masked dataset for a person with an uncommon name like “Throckmorton.”
It’s very likely that only one person in the dataset has that name, so the attacker can narrow down to the amount of data they have to look at to only those entries with unique masked names. If the data is masked with an algorithm that produces duplicates, like secure lookup, it’s likely that “Thockmorton” will share its masked value with a more common name like “Smith” or “Jones,” thus disguising the Throckmorton’s data in the masked dataset.
There is a trade-off here. The most secure secure lookup algorithm has only one entry; it maps every input value to the same output value. Indeed that is the most secure, it destroys all information, but a dataset masked in that way probably isn’t very useful since it not representative of the structure of the production data.. In general, it’s best to use the smallest secure lookup lists that still satisfies the the application that will be using the masked data.