Lock it Down or Air it Out?
Marriott International and Quora are the latest companies to prove that data breaches continue to be a threat to many organizations across industries. According to an IBM Security and Ponemon Institute study, it takes around 196 days on average for a data breach to be detected by the organization with a $3.9 million average cost of global data breaches in 2018.
Sensitive data resides everywhere within enterprises, especially non-production environments. Non-production environments for development, testing and reporting represent up to 90 percent of the attack surface for breaches and are often less scrutinized from a security perspective.
Businesses are oftentimes forced to choose between locking down data for security purposes, or making that data easily accessible to consumers. Locking down all access to data creates a lengthy and tedious process of managing access, which is not a practical or scalable security strategy. Thus, restricting access limits the availability of crucial test data causing other problems in the development cycle.
Here are 6 ways to help safeguard test data while ensuring it remains valuable for software development teams to use when and where they need it.
1. Know and communicate masking requirements
Make sure you have a plan upfront that is sensible and defines what scope you’re going to take to drive requirements and inform the resources you will need to execute. A data masking program demands good requirements, which means you have to know what you are masking, how you are going to mask it, and what you are not going to mask.
One of the worst forms for masking projects occur when QA departments make unnecessarily complicated requirements for the masked data. The time it takes to develop the rules and to actually mask the data greatly depends on the requirements, and there should be a valid QA reason for each requirement. For example, most QA testing does not require gigantic audit logs or transaction history tables, so tables can be truncated to a fraction of their production size with no impact on testing.
Once business analysts and data architects define the data needs and identify the data sources, only then can data masking developers begin to implement a repeatable process. It takes a coordinated cross-functional team to deliver quality masked data to downstream consumers.
Knowing the volume and dependencies will be critical in ensuring the infrastructure is able to perform the data masking processes reliably. Without proper alignment or dedicated resources, masking projects are very difficult to complete successfully, and it can cause significant project delays.
2. Be familiar with the dependencies
Know the volume of expected data, growth rates and the time it will take to load the data. If the masking processes are expected to run during a three-hour window, be certain that all processes can complete in that timeframe, now and in the future.
Many companies will spend years in a state of “analysis paralysis,” trying to figure out how to mask everything rather than get 80 to 90 percent of their sensitive data masked within a few months. The best approach is to reduce the risk immediately by masking the easy ones first, and then tackling the harder ones in the next phase.
For instance, if your entire database environment is old and uses a medical record number as a primary key field, masking the key field is going to greatly increase project complexity. For that reason, mask everything else first and get the MRN later.
It’s also customary to load data in parallel, when possible. Even medium-sized test environments will have gigabytes of data loaded on a frequent basis, yet the data models will have dependencies on loading dimensions. Jobs providing some of the data will need to be completely loaded before other jobs can begin.
If necessary, consider archiving incoming files if those files cannot be reliably reproduced as point-in-time extracts from their source system or are provided by outside parties and not be available on a timely basis. Terabytes of storage are inexpensive, both onsite and off, and a retention policy will need to be built into jobs, or jobs will need to be created to manage archives.
3. Map the source to staging areas and targets
What is the source of the data? Has it been approved by the data governance group? Does the data conform to the organization's master data management (MDM) and represent the authoritative source of truth? In organizations without governance and master data management, data cleansing becomes a noticeable effort in the data masking development.
It’s not unusual to have dozens or hundreds of disparate data sources. The sources range from text files to direct database connection to machine-generated screen-scraping output. There are data types to consider, security permissions to consider and naming conventions to implement.
The data masking jobs must be managed in much the same way as source code changes are tracked. If a source changes, that change needs to be documented because if someone changes a masking rule, that rule change could have major implications on the masking run and may require you to re-mask your entire environment.
4. Log everything
Whether working with dozens or hundreds of feeds, capturing the count of incoming rows and the resulting count of rows to a landing zone or staging database is crucial to ensuring the expected data is being loaded.
Unexpected things can also happen in the midst of a data masking process. When dozens or hundreds of data sources are involved, there must be a way to determine the state of the process at the time of the fault. That’s why logging is crucial in determining where in the flow a process stopped. Can the data be rolled back? Can the process be manually started from one or many or any of the masking jobs?
Data masking tools have their own logging mechanisms. Enterprise scheduling systems have yet another set of tables for logging. Each serves a specific logging function, and it is not possible to override one for another in most environments. A reporting system that draws upon multiple logging tables from related systems is a solution. For example, the Delphix masking APIs can be used to pull job results to be stored in whatever central system is in use.
5. Alert only when a fault occurs
Alerts are often sent to technical managers, noting that a process has concluded successfully. With many processes, these types of alerts become noise. Alerting only when a fault has occurred is more suitable. You can also send an aggregated alert with the status of multiple processes in a single message. While there is less noise, these kinds of alerts are still not as effective as fault alerts.
6. Document, document, document
Beyond the source-to-target mapping documents, the non-functional requirements and inventory of jobs will need to be documented as text documents, spreadsheets, and workflows. This will provide ongoing continuity for the program to make it easier to maintain as it evolves and grows.
Simply masking to protect your data versus using other approaches, such as using synthetic data, homegrown scripts or encryption, goes a long way in satisfying the security and data usability needs at an enterprise. That's the whole premise behind data masking: keeping the data secure while ensuring that it's transformed in a way that retains its business value. Likewise, if you over-mask or mask in a way that the data is no longer realistic, then your masked data is no longer valuable to your testers and developers.
Make data fast and secure for access across the company. Register for a 30-minute demo to discover how the Delphix Dynamic Data Platform provides an integrated solution for identifying, masking, and delivering secure data. You can also download our Data Privacy & Security solution brief to learn how Delphix ensures data is properly governed and secured, while also empowering teams to use that data when and where they need it.