Your testing is only as good as the data you feed it.
Aug 07, 2017
The means by which the Power of DataOps helps connect people and data varies widely by technology, users, and use case. Data Democratization, for example, has been the topic du jour in analytics circles for some time, and focuses on providing non-technical users the data they need to answer critical questions and drive new insights for the business. While business intelligence is about helping a specific set of non-technical users analyze data they already have access to, DataOps is all about reducing the data friction associated with getting inaccessible data to everyone — technical and non-technical consumers alike. Every company is a software company now, and the legions of developers and testers responsible for building and delivering that software are finding data friction starving them of the data they need to deliver innovation to the business.
Application development creates massive friction when it comes to data delivery:
Data must be available in its native form. An application written to consume PostgreSQL data cannot simply connect to a Hadoop data lake in a development environment.
Data must be fully read-write for the consumers of that data, making it difficult or impossible to use read-only or shared environments.
Data state drives application behavior, requiring the ability to start from known state, and share problematic state with testers and other developers.
Data must be realistic to drive higher quality and efficiency, with production data being the best option for most development and testing environments.
Each of these has a plethora of challenges and solutions, but in this article I want to focus on just one: the importance of realistic production-like data throughout the development lifecycle.
IBM System Sciences Institute
Defects are more expensive when found later in the development lifecycle. While there is debate over the exact nature of such a metric, the intuitive conclusion is generally accepted: late-stage defects are found in more complex test environments, requiring more effort to root cause, and more tests to be re-run once fixed. This escalating effort is the foundation for the “shift left” approach to testing, which has been thoroughly embraced by the DevOps movement. In this context, “left” refers to moving activities earlier along a timeline, such that defects are found more quickly and at lower cost.
Data state drives application behavior, so your testing is only as good as the data you feed it. And there are times when you absolutely need to use synthetic data:
You are building a new product, for which real production does not yet exist.
You are writing a unit test, which needs to have predictable behavior and hence a programmatically defined starting state.
You are writing functional tests where a full dataset might make queries too expensive to be useful in early stage testing.
For the majority of complex testing, however, fresh secure production data is the best answer. This includes manual developer testing, system testing, scalability testing, UAT, and more. Using stale, synthetic, or subsetted data may seem like a reasonable compromise, but at Delphix we’ve seen first hand the fallacy of this thinking:
An early customer angrily called our support line to tell us that our system was too slow — tests were taking way longer with Delphix than on their physical systems. After pouring over database logs and system metrics, we determined that everything looked good on our end. A few days later, the customer came back to sheepishly explain that they had a bug in their software that was triggering full table scans due to a missing index. Prior to Delphix, they had only used subsetted data (all the way through UAT), so they never would have seen this until it hit production and caused a major outage.
While talking to a prospect, they recalled a major production outage for an online education solution. A new feature had been developed and thoroughly tested, but when deployed to production caused an outage that prevented anyone — all 100,000 students and teachers using the software — from being able to submit or grade work for more than two days. The root cause? The feature had been developed using test data with only a certain class of course codes (e.g. “BME” for “Biomedical Engineering”). In the real world, course codes had different variations of letters and styles, which ended up causing rampant data corruption that had to be manually repaired in production.
At a Delphix financial services user group, a customer presented what is still perhaps the best example of “shift left” that I have seen in practice: before and after graphs that show the number of defects found in each feature release, broken down by development stage. The picture speaks for itself:
Number of defects found in each feature release, before and after deploying Delphix
These are but a fraction of the anecdotes and proof points we’ve seen at Delphix, all of which point to one clear conclusion:
Full, secure, personal production datasets are essential to driving velocity, efficiency, and quality throughout the development lifecycle.
While these stories were told through the lens of Delphix, the same is true of any DataOps solution that can efficiently full production data to developers with self-service control over the data within a personal environment.
Velocity of innovation is king, and providing everyone access to realistic data throughout the software development lifecycle is a competitive edge to winning in the software economy.
DataOps is the means to get there. It’s time to start the journey.