The Business Value of DevOps and Data

Gene Kim, an award-winning CTO and WSJ bestselling author of “The Unicorn Project,” presents at the 2021 Data Company Conference.

Gene Kim: Hello, everyone here at the Data Company Conference. My name is Gene Kim, and I've had the privilege of studying high performing technology organizations for 23 years. That was a journey I started when I was the CTO and founder of a company called Tripwire in the information security and compliance space. Our goal was to study these amazing organizations that simultaneously had the best project due date performance and development. They had the best operational reliability and stability, as well as the best posture, security and compliance. 

We wanted to understand how those amazing organizations make their good to great transformation, so we could understand how other organizations could replicate those amazing outcomes. As you can imagine in a 23 year journey, there were many surprises, but by far the biggest surprise to me was how it took me into the middle of the DevOps movement, which I think is so urgent and important. The last time that any industry has been disrupted to the extent that all our industries are being disrupted today was likely in manufacturing in the 1980s, when it was revolutionized through the application of Lean principles. I think that's exactly what DevOps is.

You take those same Lean principles, apply them to the technology value stream, where we work in every day, and you end up with these emergent patterns that allow organizations to do tens, hundreds, even hundreds of thousands of deployments per day, while preserving world-class reliability, security, and stability.

And so why do we care about that? It's so that we can win in the marketplace. In the next 45 minutes, what I'd like to do is share with you some of my top lessons learned, especially as every company is turning into a data company. I'd like to start by sharing what our definition of DevOps is. In 2016, this is the definition that we put into the DevOps handbook. It is specifically the architectural practices, technical practices, and cultural norms that enable us to increase our ability to deliver applications and services quickly and safely.

Those are the conditions that enable us to rapidly experiment and innovate, just like Jed was talking about earlier today, as well as deliver value in the fastest possible way to our customers. We do this in a way where we can ensure world-class security, reliability, and stability and so we can win in the marketplace.

It just doesn't actually describe what DevOps is. Instead, it describes what the outcomes that we aspire to are. But as much as I like this definition, there's a definition that I like even more, and this doesn't come from me. It comes from my friend Jon Smart, who led the ways of working at Barclays, a bank founded in the year 1635, which actually predates the invention of paper cash. His definition is simply this: it is better value, sooner, safer, and happier.

There are two reasons why I love this definition. One is it's shorter than the definition I gave you and yet is entirely as accurate. Two is it's so difficult to argue against. I think not even your biggest DevOps skeptic would say that they want less value later with more danger and more misery. I think this is a phenomenal way to articulate what we're trying to do within the DevOps community.

So Jed mentioned something that really resonated with me, and he said that the best is yet to come. As much values as the tech giants have created, whether it's the Facebooks, Amazon, Netflix, Googles, or Microsofts, they've generated trillions of dollars of market cap value, but as much value as they have created, it will be dwarfed by the economic value by what comes next. In other words, what happens when every large complex organization, which employs over 18 million developers, what happens when each one of those developers can be as productive as if they were working at a Facebook, Amazon, Netflix, or Google? There is no doubt in my mind that when that occurs, that will generate trillions of dollars of economic value per year. 

I had made this claim that DevOps is urgent and important, and so much in that, without something like DevOps, leads to horrendous outcomes, regardless of what organization we're in, regardless of what industry we compete in, regardless of how long our organization has been around, or even whether it's for-profit or not-for-profit. One of the best verbalizations of why this occurs came from Ward Cunningham almost 20 years ago. He called it technical debt. He described the technical debt as a feeling we have the next time that we want to make a change. In my mind, this is almost poetic.

It evokes this image for me. It is the accumulation of all the garbage we put into our data centers, each time made with a promise that we're going to fix it when we have a little bit more time. And just the way that human nature works, the way that life in general works is it's never enough time. Technical debt is like financial debt in that left unaddressed, it gets worse. This is bad, but it's not as bad as what it becomes, which may look like this. This creates horrendous outcomes, regardless of where in the value stream we reside, whether it is infrastructure and operations, whether we're a developer, whether we are a product owner or within the businesses we serve, whether it's information security, or whether we are a DBA or a data engineer.

I had mentioned that “The Phoenix Project” came out in 2013 and that there were some very large problems that I wanted to explore further, problems that still remain. One is the absence of understanding of all the invisible structures required to truly unleash developer productivity, and in an age where we must compete on innovation, nothing matters more than this. There's this other problem around data. This is an orthogonal problem of which DevOps set out to solve, which the DevOps movement rightly pointed out that it took too much effort and time to get code to where it needed to go, which is in production, so that customers will be gaining value and say, "Thank you."

This orthogonal problem is around data that is often trapped in systems of records or data warehouses, and it, too, might take weeks, months or quarters to get it to where it needs to go, which is in the hands of people who make decisions or in the developer to affect the products that we support.

This could be any one of the 30 to 50% of employees who use or manipulate data in their daily work. So arguably this is an even larger problem than what DevOps set out to solve. There's often very strong opposition to support these newer ways of working, and there's ambiguity in terms of what behaviors we need from our leaders to support this type of transformation. In “The Phoenix Project,” we had the three ways and the four types of work, and in “The Unicorn Project,” we have the five ideals. I'm going to describe what each one of these ideals looks like, both in the ideal and not ideal, and some ideas on how to get from here to there.

But before that, I want to substantiate a claim that DevOps creates business value. This is based on the State of DevOps research that I got to work on with Dr. Nicole Forsgren and Jez Humble. It's what went into the “Accelerate” book in 2018. It is a cross-population study where we surveyed over 36,000 respondents over six years with the goal of understanding what high performance looks like, and what are those behaviors that lead to high performance? What we found six years in a row is that high performers exist, and they massively outperformed their non high-performing peers.

So as measured by what? We know that they're deploying more frequently. High performers are deploying multiple times a day. That's two orders of magnitude more frequently than their peers. More importantly, when they do deployments, they can finish it in one hour or less. In other words, how quickly can they go from a change being put into version control, through integration, through testing, through deployment, so that a customer is actually getting value? They can do it in one hour or less, two orders of magnitude faster than their peers.

When they do a deployment, they're seven times more likely to have it succeed without causing an outage, a service impairment, a security breach, or a compliance failure. When bad things happen, which Murphy's law guarantees, they can fix it in one hour or less. That's three orders of magnitude faster than their peers. That's measured by the mean time to restore service.

What we've found for six years in a row is that the only way that you can get these types of reliability profiles is to do smaller deployments more frequently. Over the years, we looked at other dimensions of quality. We know that high performers, because they are integrating information security objectives into everyone's daily work, they spend only one half the amount of time remediating security issues. Because they're doing a better job in controlling unplanned work, they can deploy nearly a third more time on planned work. These are the more strategic activities, as opposed to just merely value-preserving activities around firefighting.

A couple of years after that, we started looking at organizational performance. We found that these high performers, not only did they have better technical measures, but they were twice as likely to exceed profitability, market share, and productivity goals. For not-for-profits, we found the same multiple of performance. They were twice as likely to achieve organizational and mission goals, regardless of how they measured it, whether it was customer satisfaction, quantity, or quality.

What this says is that if achieving the mission requires work that we do in the technology value stream every day, DevOps helps with the achievement of those objectives. We found other organizational markers of performance. We found that in high performers, employees were twice as likely to recommend their organizations as a great place to work to their colleagues and friends, as measured by the Employee Net Promoter Score.

Beyond the numbers, what does it really suggest to me? It says what is the opposite of technical debt? And to me, it says to what extent we can safely, quickly, reliably, securely achieve all the goals, dreams, and aspirations of the organizations that we serve. I had mentioned the five ideals. So let's jump into the first one. 

The first ideal is all about locality and simplicity. So much of this was informed by me understanding the true profundity of the birth and death of a technology called Sprouter at Etsy. Etsy is the e-commerce giant that went public a couple of years ago, and this is a story about how teams of engineers work together to create value for customers. In 2008, in the bad old days of Etsy, when engineers would quit in droves after every holiday season because they knew they couldn't survive another one. In 2008, even they knew that they had this big problem, that in order for a feature to actually delight customers, two teams would have to work together. The devs would work in the front end. In their case, it was PHP. And the DBAs would have to work in the back end, in their case, in stored procedures inside Postgres. 

What it meant is that these two teams had to coordinate, to communicate. They had to prioritize together. They had to sequence and marshal and de-conflict, and worst of all, they had to deploy together. Even Etsy at this time knew this was a big problem, so they created something called Sprouter, which is short for stored procedure router. The goal was to create this middleware that would enable the devs and the DBAs to work independently and then meet in the middle inside of Sprouter. And as Ian Malpass, a senior engineer, said this required a degree of synchronization and coordination that was rarely achieved to the extent that almost every deployment became a mini outage. If you're trying to do tens of deployments per day, this is a catastrophe, right? So nothing would work.

And so what's amazing is that as a part of the great engineering rebirth of engineering culture at Etsy, their goal was to kill Sprouter. What they wanted to do was eliminate the need for Sprouter altogether to fully enable the developers working on the front end to make all the necessary changes there with no reliance on any changes on the back end. What they found was in every part of the property where they eliminated Sprouter, suddenly, the outcomes got much better. Code deployments went way down, and the quality of the deployments went way up.

In my mind, this is one of the most marvelous examples of Conway's law. So Conway's law, to paraphrase, if you have five teams working on the compiler, you will get a five pass compiler. Consider what happened when we went from two teams that had to synchronize, communicate, and work together to create value to three teams. Suddenly, code deployment lead times went way up, and the quality outcomes went way down. But if we went from three teams that needed to work together to one team where there was no need to communicate with anyone else, then suddenly, code deployment lead times went way down and the quality outcomes went way up.

It is not enough to move boxes around on an org chart. Instead, we must also have an architecture that enables teams to be able to work independently, to enable every team to be able to independently develop, test, and deploy value to customers. Consider how bad life was when there were only three teams involved.

But in most large complex organizations it's not three teams who need to communicate, coordinate, synchronize, marshall, sequence, and deconflict, it could be scores of teams. How do I know that? It's because there's this diagram right here. You initiate a ticket to start a deployment on the left and you might have to transit scores of teams to get it into production, so customers actually say thank you. You might need environments. They might need to be configured properly. You need data. You need it to be anonymized. You need to have security reviews, change approval boards, middleware changes, firewall rule changes. It doesn't take a lot to go wrong before we are looking at code deployment lead time that is measured in weeks, months, or even quarters.

One of the biggest surprises for me in the State of DevOps research is to what extent architecture predicts performance. In 2017, we tested this in the state of DevOps research and we found that architecture is one of the top predictors of performance. In fact, it is a higher predictor of performance than even continuous delivery. To what extent can we make large scale changes to our parts of the system without permission from anyone else outside of our team? To what extent can we do our work without a lot of fine grain communication and coordination with people outside of our team? To what extent can we do our own deployments and releases on demand, independent of services that we depend upon?

I love this one. To what extent can we do our own testing on demand without the use of a scarce integrated test environment of which there are never enough, they're never cleaned up, which actually jeopardizes the actual test objectives? And if all those things are true, we should be able to do deployments during normal business hours with negligible downtime.

What this finding shows is that nothing impacts the work of developers than architecture. In “The Phoenix Project,” if there were a metric that that book was about, it is the bus factor, as measured by how many people need to be hit by a bus before the project service or organization is in grave jeopardy. And in the Phoenix Project the bus factor was one because it was Brent, right? So if Brent got hit by a bus or won the lottery and left, suddenly no outage could be fixed, no major piece of complex work would be done because all the knowledge was in Brent's head, right? And so we want not a low bus factor, we want a high bus factor. We don't want to be reliant on an individual, we want to be reliant on a team or better yet a team of teams.

In “The Unicorn Project,” the corresponding metric would probably be the lunch factor as measured by to get something important done. How many people do we need to take out to lunch? Is it the Amazon ideal of the two pizza team where every team can independently develop, test, and deploy value to customers no larger than can be fed by two pizzas or do we need to feed everyone in the building?

Consider the case of that complex deployment I shared with you. We might have to feed 200, 300 people for multiple days, right? So that's a very high lunch factor. Or consider the case when we need to innovate on behalf of the customer. If we need to implement the features, but we have dependencies on 43 different teams, including the data team, then suddenly we have to take 43 different people out to lunch. We have to explain what we're trying to do, why it's important, what we need from them, and if any one of them says no, then suddenly we can't get what needs to be done.

In the ideal, anyone can implement what we need to by looking at one file, one module, one application, one namespace, one container, and make all the needed changes there and then be done. And not ideal is that to make our needed changes we have to understand and change all the files, all the modules, all the applications, all the containers because the functionality is smeared across that entire surface area. And so that is obviously not ideal.

Ideal is that we can not only implement changes in one place, but we can also test them in one place. In other words, changes can be independently implemented and tested, isolated from all the other components. And so that's the notion of composability. Not ideal is that for us to get any assurance that our changes will actually work as designed in production we have to test them in the presence of every other component, thus drawing us back into those scarce integrated test environments, just coupling us to the entire enterprise. So that is obviously a very high lunch factor. Low is better.

Ideal that every team not only has the expertise and capability, but they also have the authority to do what our customer wants. Not ideal is that to get anything done we have to escalate everything up two, over two, and then down two, or visually depicted we have to go up two, over two, then down two.

I think one of the best examples of this, both in the good case and bad case, is the book Team of Teams written by General Stanley McChrystal by Chris Fussell and Dave Silverman. This is the amazing story of the joint Special Forces task force that was battling a far smaller and nimbler adversary in Iraq in 2004. Their mission was to dismantle the Al-Qaeda in Iraq terrorist network. They found that they couldn't despite being larger and having better technology. It was only after being able to push decision making down to the edges and allowing middle level leaders to work across a vast enterprise that finally they were able to make this so. This is one of three books I'll recommend in this presentation.

I love the phrase data is the new oil, but I love this phrase even more. Data is the new soil. Make no mistake. Data is a software game.

I have seen so many instances where leaders leading $50 million software projects don't actually have a software technology background, so they're not even doing things like version control, right? And these are absolutely critical to not screw things up but more importantly to achieve the mission.

The first ideal is all about locality and simplicity. The second ideal is about focused flow and even joy. So much of this was informed by me learning a functional programming language called Clojure that either runs on the JVM or gets transpiled into JavaScript. This is one of the most challenging things I've ever had to learn professionally, but it was also the most rewarding. And so I did this in around 2016.

Just to set a context of how impactful this was to me, for decades I have self identified, not as a developer, but as an ops person. This is despite getting my graduate degree in compiler design back in 1995. But it was my observation that it was ops where the saves were made, it was ops who saved us from terrible developers who didn't care about quality, but pushed to production anyway. It was ops who actually protected our data and applications because it certainly weren't those ineffective security people.

What I found is that I changed my mind. After learning Clojure, I now decisively self identify as a developer. And I think it's because it's so much fun and also demonstrates that you can build so many great things with so little effort these days because of all of the miracles that modern technology affords. And so the famous French philosopher, Claude Lévi-Strauss, would say of certain tools, is it a good tool to think with? I think there are so many things from the functional programming community that are better tools to think with. Things like immutability and composability. Immutability is a notion and the property that once you create something you can't change it. Composability is about the ability to be able to independently test and run things independent of other components.

These were pioneered in programming languages, but these are such good tools to think about that they're showing up in infrastructure and operations as well. So if you look at Docker, fundamentally it is all about immutability, which is a reason why once you create a container you can't really change it. If you want it to persist, you have to create a whole new container. Kubernetes applies that to not just a component level, but to the system level. Whenever you see something like Apache Kafka or CQRS or event streams, someone's trying to think about creating an immutable data model where you're not allowed to change the past. 

In fact, version control is all about immutability which is why we get yelled at if we change the commit history, right? Because we're really not supposed to change the past. Databases. Terrible things happen if we rewrite values that we can't get back, right? In the worst case we may have to restore data, which is why things like Delphix make it so much safer to work with data. In the ideal, when we are using these better tools to think with, our time and energy are focused on solving the business problem and we're having fun because work is safe and creates joy.

Not ideal is that all our time is spent trying to solve problems that we don't even want to solve. Whether it's writing YAML files or trying to escape spaces inside of file names inside of make files or writing endless bash scripts. And I will tell you that one of the biggest surprises for me after learning Closure is that there are all these things I used to enjoy doing a decade or two ago that I now detest doing.

Here are all the things I detest these days. I hate everything outside of my application. I hate connecting anything to anything else because it always takes me a week. I hate updating dependencies because everything breaks. I hate secrets management. I'm the person who often checks in secrets into the repo causing all sorts of grave problems. I hate test data management, Bash, and YAML patching. I can't figure out Kubernetes deployment files nor can I figure out why my cloud costs are so high. I don't mean to diminish any of these activities, especially when it comes to security as described by Jed earlier today, right? Arguably security is as important as the feature I'm trying to build, but it's just not as much fun for me anymore.

What I find so exciting about infrastructure and operations these days is that you can do almost everything self-service within the development platforms that we use in our daily work. Whether it's monitoring, deployment, environment creation, security scans, orchestration, data anonymization, all of these things we can get from platforms that we use in a way that is self-service and on-demand. In other words, we don't have to open up a ticket and pester someone for weeks to get us what we need. Instead, we can get it with immediacy and fast feedback. Those are the conditions that allow us to work with a sense of focus and flow or even joy.

And that is why I make the claim that there's never been a better time to be in infrastructure and operations because all that knowledge is as important as it ever was, but we don't know it in people's heads. We want it in the platforms that developers use in our daily work. Flow has a lot of connotations in the Lean community, but there's a connotation that I love which comes from Dr. Mihaly Csikszentmihalyi.

He gave what I believe is the best Ted Talk of all time called Flow: The Secret to Happiness. He also wrote this amazing book called "Flow: The Psychology of Optimal Experience." He describes flow as a condition when we are doing work that we truly enjoy, that challenges us, that rewards us, where we lose sense of time, or maybe we even lose sense of self, that transcendental experience that we have when we are truly doing work that we enjoy. And I think what is so great about platforms that enable developer productivity is that they allow developers to have a sense of flow that allows us to be hundreds or even thousands of times more productive than we're not.

And so before we leave the section I want to share with you one metric that I believe is the most strategic metric for any technology organization. And that is what is our lead time for changes. And so that is a point of which we introduce a change in diversion control through integration, through testing, through deployment so that customers say, thank you. And so that might beg the question, why do we start the lead time clock there? Why don't we start earlier when say the feature goes into implementation and development or say when the idea is first conceived? In other words, the point of ideation. And those are all valid lead time measures, but the reason why the code commit lead time is so important is that it represents the dividing line between two very different parts of the value stream. 

So everything to the left of changes being put in diversion control is design and development. And so those are highly experimental processes. Often we are doing work for the first time, maybe never again to be repeated. Estimates are highly uncertain and variable. And that's just a fact of life. These things don't take minutes. And variables. And that's just a fact of life. These things don't take minutes or hours. Large design development projects may take weeks, months, or even quarters, but it is very different from what happens after we put changes into version control.

This is product delivery or better yet, product build, test and deploy. And here we want the exact opposite characteristics. We want builds, tests and deploys to happen quickly, reliably at the same time, every time, ideally, entirely automated. And we don't want it to take weeks, months or quarters. We want these things to be happening in minutes or hours. Oh, and better yet, we want these things to be happening all the time after every code commit. And so code deployment lead time simultaneously predicts the effectiveness of build, test and deploy, but it also predicts how quickly are we giving developers feedback on their work?

Consider the scenario. If I make a mistake and I check into version control, if the first opportunity for me to detect that error is nine months later during integration testing, then the link between cause and effect has basically vanished. So, that means the time to find the error and the time to fix the error is going to be orders of magnitude longer. In the ideal, what should happen? The incident I checked into version control, automated testing kicks in, and I should find that error within minutes, worst case, hours. And so this will bring down the time to find and fix the error, but it also will prevent me from making the same mistake for the next nine months.

This game is not about just learning from our mistakes. It's also about learning from customers. One of the best examples of this comes from Scott Cook, the founder of Intuit. He said by performing and creating this rampant innovation culture, in this case, it's for the TurboTax property, they were able to perform 165 experiments during the peak three months of the tax filing season. And so if you're like me, when you might be thinking these people are maniacs. Many retailers don't make any changes during the holiday season, because they are afraid of an outage.

Why would these people make changes when it matters the most? And the answers are revealed in the next paragraph. He says the business result was that they were able to increase the conversion of their website by 50%. Employee result was that everyone loved it, because they could see their ideas making it to market and moving the needle in terms of satisfying the customer's needs. And so this game is all about outlearning the competition and you can't do that without making deployments safe.

And speaking of safety, it turns out that all the measures I've talked about today, you can predict with one simple question. On a scale of one to seven, to what degree do we fear doing deployments? One is we have no fear at all. We just did one. Seven is we have an existential fear of doing deployments. And that is why in our ideal, the next deployment we would do is never. We never want to make another deployment again. You want deployments to be able to be made fearlessly and safely.

In the ideal, we should be able to implement and test whether our feature works on our laptop and learn whether it will work in production within seconds. Not ideal is that the only way we can get any assurance that our feature will work as design and production is by waiting minutes, hours, days, weeks, or even quarters. Ideal is that we can get the data we need in minutes or hours.

Not ideal is that we're being held hostage to some sort of shared service team that makes us wait, days, weeks, months, or even worse. The first ideal is all about locality and simplicity. The second ideal is about focus, flow and joy. The third ideal is about improvement of daily work. So, this shows up in “The Phoenix Project” as well. 

The notion that improvement of daily work is even more important than daily work itself. So, not ideal is TWWADI, in other words, the way we've always done it. That's not good. Ideal is MTBTT. In other words, let's make tomorrow better than today. 

This is of course, Google SRE principle number two. And I think this too is almost poetic. Who would not want to make tomorrow better than today? In manufacturing, a great example of not ideal was the General Motors, Fremont manufacturing plant in Northern California. For decades this was notorious, because they were the worst performing automotive plant, not just in North America, but around the globe. There were no effective procedures in place to detect problems during the assembly process, nor were there explicit procedures on what to do when problems were found.

As a result, there were instances of engines being put in backwards, cars missing steering wheels or tires, cars even having to be towed off the assembly line, because they wouldn't start. So, obviously that's not ideal. What is ideal? It is a system where we are putting as much feedback into the system as possible, sooner, safer, faster, and cheaper with as much clarity between cause and effect. And the reason why is that we want to be able to invalidate assumptions, because the more of that we do, the more we learn and the more we are able to outlearn the competition.

In 2011, I had the opportunity to spend a week at the University of Michigan getting trained in the Toyota production process. I was amazed to find during the field portions of the training that plants modeled after the Toyota production system had this cord on top of every work center where everyone's trained to pull it when something went wrong. If I create a defective part, I would pull the cord. If I got a defective part from someone else, I would pull the cord. If I had no parts to work on, I would pull the cord. I would even pull the cord if the work took longer than documented. If it took a minute 20 seconds, when it was supposed to take 55 seconds, I would pull the cord. If the problem couldn't be resolved in 55 seconds, the entire assembly line would stop.

Just in your head, think how many Andon Cord pulls per day are there. The answer is this, 3,500 times per day. And so if you're like me, you think these people are maniacs. These people have no idea what they're doing. And I think the reason why I reacted that way is that the way I was trained, especially as a first-line leader, was that I had to solve areas in my area of responsibility before it caused a global disturbance. And it seems like the Andon Cord is doing the exact opposite. Is that they are pulling the Andon Cord 3,500 times a day now potentially turning every local disturbance into a global disturbance. And so, why would they do that?

I think there's two answers that are given when you ask people on the plant floor. One is that if you don't pull the Andon Cord then and there, technical debt will accrue downstream where it will become much more difficult and much more expensive, or maybe even impossible to fix. So, that's number one. The second answer that you'll hear, I think is even more profound, which is that if you don't pull the Andon Cord and put it in a systemic fix then and there, you will have the same problem 55 seconds later. And so that is a notion of a daily workaround. And so daily workarounds exist in our world as well. But because work takes longer than 55 seconds, it is not as visible, but trust me, it is just as destructive. This gets to the point that greatness is never free.

Greatness is a decision. It is an investment. In our work, it comes from paying down technical debt as you go. About 15 years ago, someone wise once told me, “When dealing with the executives, stick with small numbers and primary colors,” which I thought was very funny. But what I've learned in my journey is that for complex ideas like technical debt, that is too complex. We must stick with something simpler. We must stick with up and down. So I'm going to tell you where technical debt comes from using only up and down arrows. Okay, here we go. Imagine the time when you had to get to market quickly. Sometimes it was to be first to market, which is a great thing. Sometimes, you just want to get in the market. You'd be happy to be last to market. You have to get in the game.

When you're in that mode, all the focus is on the feature. This is where we are glad to accept and build up a little technical debt and risks, which drives down quality, which drives up the number of defects that affect customers. The story doesn't end there. When you fast forward in time. Invariably, what you find is that the feature rate goes down and the amount of time working on defects goes up to the point where we can even cross 100%.

These are exactly the conditions where defects dominate daily work. Site reliability tanks. We go slower and slower. Customers leave. Morale plunges and engineers leave because everything is now just so hard. Everything that was once easy now almost seems impossible. A friend of mine on Twitter, John Cutler from the product community. He said, "Yes, exactly. In 2015, a certain reference feature would take 15 to 30 days, three years later, the same class of feature now takes 10 times longer."

If you have felt it, you're not crazy. This is because of technical debt. And this even happens even when you add more developers. And so make no mistake, technical debt kills companies. One of my favorite examples of this is Nokia. The second book I'll recommend is this amazing book called “Transforming Nokia” by Risto Siilasmaa. You may be thinking, what can I possibly learn from someone who oversaw the decimation of 93% of the market cap of Nokia in their battle against apple? And I would say, a lot. Risto Siilasmaa was the founder of F-Secure. He got invited to join the board of Nokia in 2008, and he thought this would be the culmination and the height of his career. 

He writes this unflinching assessment about how he was unable to deal with a very domineering chairman. It is a great book for so many reasons. My favorite line in the book was this phrase, when he said, when he learned in 2010 that the build times for the Symbian OS took 48 hours. He said it felt like being hit in the head with a sledgehammer, because he said, he knew that if it took two days for anyone to determine whether the change worked, or it would have to be redone, then Symbian OS of which resided on all their hopes, dreams, and aspirations was a Mirage. That's what actually drove them to Windows mobile, which didn't treat them so well either. But that was actually a far better bet than staying on Symbian OS.

Nokia didn't make it, but every tech giant found themselves in the same situation, but they made a different decision. Whether it's Ebay, Microsoft, Google, Amazon, Twitter, LinkedIn, they almost all died from technical debt. They all realized that they had to reduce technical debt to survive as a company. And one of my favorite examples of this is the 2002 Microsoft security stand down. And so some of you may remember, this is when almost every Microsoft product was being moaned down by secure vulnerabilities. Things like SQL Slammer Code Red Nimda.

This eventually prompted Bill Gates to have a feature freeze across almost every product at Microsoft. He wrote this incredible memo to every employee and to the world, and is now known as a trustworthy computing memo. One of my favorite lines there was that if a developer ever has to choose between adding a feature or addressing a security issue, fix the security issue, because the survival of the company depends upon it. And so make no mistake, this was incredible. Almost every Microsoft product stopped working on features for almost a year.

So, what happened? Every company who made it, they reduced feature work to zero so they could pay down technical debt so they could increase quality and drive down the defects, maybe not to zero, but something that could be sustained over time and invest in architecture that could elevate developer productivity, thus enabling us to ship more features than ever, often increasing by orders of magnitude. This is not just CEOs of companies who say this. It is also product managers and product owners as well.

This is Marty Cagan. He wrote the famous book," Inspired: How to Create Products that Customers Love." He has trained generations of product owners on how to build great products. And one of the pieces of advice that he's given them is you must take 20% of all engineering cycles off the table. They're not there for you to spend. They're for engineers to use however they best see fit to fix problematic areas of code, to refactor the environment, to re-architect, to automate, to do whatever it takes to keep working on features. And so where did he learn this lesson? He learned this lesson ...

Where did he learn this lesson? He learned this lesson at eBay, where he served as the vice president of product management for two years. And this is during the early 2000s. And during that time he didn't ship one major new feature because every engineer was just trying to keep the site up. And so, what he realized is that, if you don't pay your 20% tax, you will inevitably pay 100% tax because you will not be able to ship a feature. The only way to avoid this fate is to pay technical debt down as you go.

In the ideal, we have three to 5% of our best engineers working on improving developer productivity. Google famously has over 1,500 developers just working on dev productivity. That's a billion dollars plus of annual spend. Microsoft has likely multiples more working on debt productivity than that. If that's ideal, what's not ideal? The only people working on dev productivity are the summer interns, and people not good enough to be developers.

In a really neat story of karmic continuance, Satya Nadella in a town hall meeting a year and a half ago said, "If a developer ever has to choose between working on a feature or working on their own productivity, choose your own productivity, because this is how we make compounding interest work in our favor." Right? Whereas technical debt is very much working against us. 

The first ideal is all about locality and simplicity. The second ideal is about focus, flow, and joy. The third ideal is about improving daily work. And the fourth ideal is psychological safety. When researching “The Unicorn Project,” it was so rewarding to revisit the work of Project Aristotle at Google. This is the incredible study that spanned over 60,000 employees across six years across 200 plus teams. This was all in service of their quest to understand what made their greatest teams great.

What they found in the research was that it was one thing, psychological safety, that made the biggest impact, as measured by to what degree do members on a team feel safe to say what they really think, to take risks, to innovate without fear of being ridiculed, embarrassed, or even punished. This was a higher factor of what made teams great than dependability, structure and clarity, meaning of work, or even impact of work.

Since 2014, my area of passion has been studying not so much the tech giants, the Facebooks, the Amazons, the Netflix, Googles, and Microsoft, but instead large complex organizations that have been around for decades or even centuries. And what I find so amazing is that these organizations across every industry vertical are using those DevOps principles and patterns to win in the marketplace.

I've been holding this conference since 2014 called the DevOps Enterprise Summit. We've just held our 13th event. And I am so delighted that we've had hundreds of case studies from hundreds of leaders showing how DevOps is being used across every industry vertical. I am so delighted that, over the years, we've had over 900 talks about how technology leaders across almost every industry vertical from some of the most well-known brands showing how they are using DevOps principles and patterns to win the marketplace. And so, one of my favorite case studies, it comes from this person, Heather Mickman.

She presented at the first DevOps Enterprise Summit in 2014 and has presented over the years about her journey at Target. She was a senior director of [inaudible 00:39:14] at Target. And the business problem that she set out to solve was this, that every time a development team wanted access to a system of records, they would often have to wait six to nine months for them to get access. And the reason is that everything was integrated point to point. And so, it took six to nine months for the integration teams to set up that specific integration.

The way she set out to solve that problem was creating the Next Generation API Enablement Project. This enabled hundreds of initiatives, everything from ship to store, so one of the most strategic capabilities if you are trying to compete against Amazon, all of the in-store apps, the Starbucks integration, the Pinterest integration. All of that was enabled by this project. It was an architectural change that made data available to people who needed it.

As a testament to her work, her team doubled in size year over year for four years in a row, just showing how strategic of a capability they viewed this. In 2015, I actually got to follow her around for a couple of days. And so, I saw many, many amazing things. But one of the most amazing things was the certificate that hung on her desk. It says, "Lifetime achievement award presented to Heather O'Sullivan Mickman for annihilating TEP and LARB." So, of course I had to ask what is TEP and LARB? She said, TEP is the technology evaluation process. LARB stands for the lead architectural review board.

Whenever you want to do something new, you would have to fill out the TEP form, which eventually enabled you to pitch to LARB. You would walk into this room and you see all the development and enterprise architects on one table, all the ops and security architects at the other table. They pepper you with questions. They start arguing with each other. And then, they assign you 30 more questions and invite you to pitch them again in 30 days. To which she responded, "None of my teams should ever have to go through this. In fact, none of the thousands of engineers at Target should ever have to go through this." And she asked, "Why is this process here?" She said no one could really remember.

There was some memory of some sort of unspeakable disaster that happened 30 years ago. But what exactly that disaster was has been lost in the mists of time, and yet the process still persists. Due to her endless and relentless lobbying led to dismantling the TEP and LARB, earning her the certificate that hung proudly on her desk for years.

This is something that we saw in every technology transformation at DevOps Enterprise. And we actually test this in the state of DevOps research. We asked every respondent 15 questions along five axes. We asked vision. To what extent does your leadership understand the grandest goals of the organization? And to what extent can they get in front of them, not just to be relevant, but to help with the achievement of the most important goals. Intellectual stimulation, to what extent can the leader challenge basic assumptions of how we do work? In other words, just because it made sense maybe 30 years ago doesn't necessarily mean it makes sense now. Inspirational communication, supportive leadership, personal recognition, we found all of these impacted performance.

In fact, we found that, for the respondents with the bottom 30% of these reported characteristics, they were only one half as likely to become high performers. So, in the DevOps community, we love talking about blameless postmortems. We love talking about chaos monkeys, where we deliberately inject faults into a production environment to see if we are as resilient as we think we are. But none of that is possible without psychological safety, which now gets me to the last ideal of customer focus. And this is one of the most profound professional aha moments of my career.

It is January, 2018. I'm in Detroit, Michigan to visit Chris O'Malley. He is the CEO of Compuware, the famously resurgent mainframe vendor. And I am so much looking forward to this day because I've learned so much from him over the years, so much so that I invited my buddy, Dr. Mik Kersten, author of the book Project to Product. And so, we're walking to the campus and I looked down at the agenda that they had prepared and I immediately feel embarrassed because the first thing on the agenda is a data center tour. So, I turned to Mik and I tell him, "I'm so sorry that you flew out all this way. I don't know what we're going to learn by seeing someone's Halon extinguishers, which I've already seen plenty of in my career."

However, what we saw in the data center blew our minds, because here's what we saw. It was about 45,000 square feet of empty data center space. So, you look down on the ground and you see these green outlines, like in a murder scene, of where the server racks used to be. In the middle of each outline is a tombstone describing what business process used to run there and how much money they saved by either getting rid of it or moving it to a SAS vendor. And so, you'll see things like AMEA financials, on-prem email, financial ERP. You see this other sign that says, "Over 17 tons of obsolete equipment either removed or recycled, sent to a better place." The reason why this is so amazing to me is that it is one of the best examples of moving from context to core. 

This book comes from the book “Zoned to Win” by Dr. Geoffrey Moore. Core, he defines as the core competencies of the organization that create lasting durable business advantage that customers are willing to pay us for. Context is everything else. Context could be mission critical. It often is, like payroll systems. Right? We have an obligation to pay our employees on time and accurately, but it is not something that customers value. They are not willing to pay us extra money just so that we can have world-class payroll services. That last picture showed eight million dollars of context being reallocated to core. Eight million dollars of things that customers don't care about into R&D, which customers absolutely do.

Not ideal is that functional silo managers prioritize their silo goals over the grandest business goals. Ideal is that functional silo managers, or for that matter anybody, we can look at the work we're doing at any given point in time and ask ourselves, is this creating lasting, durable business advantage that customers are actually willing to pay us for? And, if not, we should unflinchingly ask, should we be doing this work at all, or give it to a vendor where it is their core competency, and we are happily willing to pay money for them. Why do I think this is important?

The world is changing very fast. It is big not beating the small anymore. Instead, it is fast beating the slow. 

I started this presentation with a claim that I believe as much value as the tech giants have created, that will be dwarfed when every organization across every industry vertical of which employs over 17 million developers. 

If we can get them as productive as if they were out of Facebook, Amazon, Netflix, Google, and Microsoft, that will without doubt add trillions of dollars to our economy every year. When that happens, suddenly the impossible becomes possible.

Suggested reading


Natura & Co: DevOps for Beauty and the Planet

The fourth largest beauty company in the world is embracing data and DevOps principles to shape a better, more caring world

Overcoming the Innovation Deficit

Delphix CEO Jedidiah Yueh’s opening keynote at the 2021 Data Company Conference.

Digital Transformation Priorities in the CxO Suite

Executives at First Foundation Institute and Bank of the West discuss what financial services leaders need to know as they pivot from being digital learners to digital leaders.