DataOps: What You Stand to Lose — Part 2

By focusing in the DataOps areas of version control and transformation, companies can tackle the friction that manifests in data defects.

In part two of this series, I want to look at the impact of DataOps on data defects. Read part one here.

Decisions and actions based on incorrect information carry far more potential to be destructive than data-delayed decisions and actions. Scott Emigh, CTO of Microsoft’s US Partner organization,  recently said “Your analytics are only as good as the quality of data on which you reason.” I extrapolate this same logic to Machine Learning

Scott Emigh's headshot
Scott Emigh, CTO of Microsoft's US Partner organization

(ML), where technology is “doing the reasoning” for us. With the recent proliferation of these technologies, exponentially more decisions are being made every second of every day, on both good and defect-laden data. This makes data quality even more imperative. Would you rather make one decision a day with a 10% data defect rate or make that decision one million times with a 0.1% defect rate? Likewise, this also extends into the DevTest space; all of the automated testing in the world means nothing if the quality of the underpinning data makes the tests unreliable or generate false positives. Scott Prugh, VP of Development at CSG International, observed that automated tests that ran on insufficiently sizeable data sets – several hundred MB vs several hundred GB – were a major contributor to production failures.

By focussing in the DataOps areas of version control and transformation, companies can tackle the friction that manifests in data defects. First, where applicable, bringing data under version control allows data operators and consumers to start their work with data originated from a discrete point in an immutable data repository, which greatly improves integrity and trust of the data from the beginning. Most data needs to undergo some sort of transformation before it is ready for the data consumer. This transformation can take many different forms, depending on the requirements: subsetting, synthesis, ETL, endian conversion, relational-to-nosql, etc. Conducting these activities inadequately can result in myriad data defects like duplicate data, missing data, datatype mismatch, missing corner cases, and improper data sequence, for example.

Scott Prugh's Headshot
Scott Prugh, VP of Development at CSG International

The way to combat these errors is to repeatedly refine and automate these activities, subjecting transformation activities to their own quality tests, and then committing the transformed data back into version control. I have been to many companies across the globe that have armies of people devoted specifically to data transformation activities. There are two common problem areas I have seen among these customers.  

First, without a DataOps centric approach, they’re still experiencing constraints around getting data to the transformation teams to work with. The transformation teams often have to work with old data set that they repurpose over and over again, which injects serious quality issues. That’s right, in this instance the Data Operators are also Data Consumers, and victims of “siloization.” Adopting the self-service and automation I spoke about in the delays post allows the data transformation groups to work with fresh full data sets, as needed, increasing the quality of their activities.

The second problem area is the constant changing nature of transformations. Data consumers are not always aware that the data they are leveraging today is different than yesterday. How many times have we all asked the question “what changed?” to only hear the same answer of “nothing.” Committing the transformed data sets to version control allows a data consumers to have a high confidence in their activities and products. 

Tackling both of these areas simultaneously allows you to really address the constraint, rather than just shifting the pain. Adopting this approach has allowed companies to increase the number of defects found in development, while dramatically decreasing the number of overall total defects. This has allowed them to achieve the fast-feedback/higher quality dev/test loops promised in shift left methodologies. A deeper explanation of this key performance indicator is covered by my friend and boss, Eric Schrock here.