You Can’t Protect What You Can’t See
Data is growing in size, with IDC predicting 163 Zettabytes of data by 2025, but also in complexity. The rise of Open Source software, DevOps, and NoSQL databases has created a new generation of developers empowered to choose the right tool for the job across an ever more heterogeneous technology stack. Companies are not just migrating to the cloud, but are embracing multiple public and private clouds - choosing carefully where these applications run in order to best utilize those cloud services.
This landscape creates new challenges for data security. Successful security relies on a thorough understanding of our systems, and a clear notion of potential attack vectors from adversaries both known and unknown. As Donald Rumsfeld, the US Secretary of Defense once put it during a press conference in the months leading up to the US invasion of Iraq:
“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.”
Sensitive data is increasingly an “unknown unknown”. As new data is created continuously, and the structure and form of data constantly evolves, it becomes difficult to even know where to look, let alone how to fully understand and manage the risk that data creates throughout a company.
It’s like looking at the tip of an iceberg: while we may confident in our ability to protect what we can see from unknown assailants, there’s a massive amount of invisible risk lurking beneath the surface. For example, even when you do manage to secure critical production data, that data is often copied and moved to other environments for development, testing, analytics, and more.
Traditionally, this has been a problem of centralization. Using classic techniques - such standardizing on a single data platform, or forcing all changes through a single architectural and security review process - is increasingly difficult in today’s world. As data spans more places and formats it gets fragmented. And so do the the teams of people who consume and manage it.
This fragmentation is critical for enabling speed and agility in the business, but it creates significant data friction when it comes to identifying and understanding data risk within the business.
Identifying and tracking assets is well-understood practice, but data friction demands a new approach, one that can complement human systems by leveraging software to identify and qualify risk in data across the enterprise. The basic elements are straightforward:
- Establish a platform that can connect to your critical systems
- Use data profiling techniques to identify sensitive information such as names, addresses, and social security numbers
- Rigorously mitigate risk identified by these processes
But even this approach has limits. First, it requires that you know how to connect to the relevant information, and requires someone to know which types of information present the highest security risk.
It sounds like a job for machine learning and artificial intelligence, and some cloud service providers already offer it.
Amazon Macie is an AWS service that uses machine learning to automatically discover, classify and protect sensitive like PII and intellectual property data stored in Amazon S3. It also offers a dashboard for tracking how the data is accessed, used and moved.
Google Cloud provides a similar API for applications to use, known as the Data Loss Prevention API. Not only does it provide the ability to search text for sensitive information, but can process images and redact information automatically.
But they’re just early examples how the data security practices must evolve. We must move beyond just unstructured data in a single cloud, towards structured and unstructured data across increasingly heterogeneous environments.
And as we’ll explore later in this series, simply identifying data is not enough. We need to understand how systems connect to data workflows in the enterprise. Are they production sources? Non-production sources? How are copies of that data made? Where are they going and where have they been?
DataOps practices help solve these problems by bringing together the people and technology responsible for creating and changing data sources, and understanding how that data flows throughout the enterprise. By leveraging software to complement process, you can free up human capital to invest in building the right collaboration and visibility into the data pipeline.
Techniques such as machine learning will shine lights on things that no human can detect. And as we’ll see later in this series, addressing this from a systemic data-centric point-of-view is the only way to adequately identify risk across a complex enterprise.