Moving Masking to a Node-based Architecture
2017 was the year that we saw data masking take off, worldwide. Data masking solutions have existed for over a decade, but data masking initiatives (as part of data privacy initiatives) have stalled. This is for a variety of reasons, including:
- Data masking products or processes have been too hard, too custom and too slow
- The impact on testers and analysts is high if masking is done haphazardly or incorrectly
- There has been no overarching mandate to get it done
Delphix has focused on solving the first two, but it’s the market perception / awareness that really made masking take off by affecting the third item. With ever present concerns about data breach and data loss and heavy-handed regulations like GDPR, effective data masking is now a must for organizations. As masking becomes widespread and part of standard development cycles, we’ve noticed several changes that impact
- More data sets require masking: Masking isn’t option - when copies of data are made, sensitive data need to be removed.
- Data volumes are larger: Our customers have shared with us that 50% year over year growth is still common.
- More content is masked: Inside of each database or data set, enterprises are masking more data - people are realizing that compliance requires masking all sort of inference data as well - not just obvious choices like first name and last name.
- Masking happens more frequently: Developers and analysts require fresh data (what Delphix has always aimed to offer!) meaning that data need to be masked frequently and on an ongoing basis
- Masking needs to happen in more places: Data are sprawled across databases, file systems, NoSQL databases, PaaS databases and SaaS services. Any time data copies are made, they need to be masked.
These five changes have some fundamental impacts on how the idea of manipulating data takes place - notably in compute required and locality. While we continue to drive to faster and more autonomous masking, there’s a massive increase in the amount of masking that needs to be done and for performance and security reasons, masking needs to be done local to the data (which live in many more places).
Our product traditionally has been a monolithic software application. The amount of masking requires and the frequency with which it was occuring was rarely high enough to justify needing more than one or two instances of a large virtual machine running Delphix masking. For all the reasons above, this has begin to change and our customers’ plans now including running dozens or more instances of Delphix masking as needed (frequently ephemerally). This requires as architecture change.
Today our customers largely deploy a small number of very large instances of software and coordinate via the API.
Moving forward, we’re building towards stateless masking engine nodes - instances of the software that can be deployed for as long as needed, wherever they’re needed. As an example - our customers are expecting to masking near the data - either in country as prescribed by data security laws or in public / private clouds. The the above diagram changes fairly dramatically to one where we have a stateful, ever present management layer and many (thing dozens or hundreds) of ephemeral or long-lasting instances of the masking engine.
Central coordination and review of data masking / privacy rules and remote as-needed masking instances.
We’re taking the first major step towards this future in the 5.2 release with the release of a comprehensive masking API and the ability to remotely coordinate masking algorithms.
- Masking API: A masking API (similar to what we’ve used in virtualization for years) allows our masking product to be integrated into existing workflows (e.g. as part of provisioning data for dev / test) and provides the basis for our future state. In this release we’ve built APIs for all items around masking jobs, but expect this to expand to all operational aspect of the product.
- Algorithm Coordination: One of the major reasons our customers use Delphix Masking is that our algorithms are designed to consistent mask the same date the same way. As an example, “Jason” will always mask to the same output, every time. As applications typically span many tables and databases, this is crucial to having a usable data set after masking. As we make the transition covered above - many instances of software masking a large number of disparate datasets, this problem becomes even more challenging. In 5.2, we’re introducing support for coordinating algorithms across many instances of our software - ensuring masking consistency regardless of where the data lives.
This is a long journey to fundamentally redesign the way our software is used, but we think the data security problems are severe enough and the market is changing enough that it’s crucial for us to start this process now. We look forward to sharing more details in the future.
Finally - while this blog series just hit on some of the big items, I encourage you to take a look at the full notes of what was released in 5.2 and more importantly, get your hands on the software! The changes in virtualization and masking should be a lot of fun to play with.