The zero-error fallacy: What really counts for DevOps teams
"We went from 26 coding errors per month to between 0 and 2. Isn't that amazing?" the CEO said to me, beaming with pride. I nodded, silently wondering how to respond without spoiling my chances of ever reaching the man again.
"How did you do it?" I finally asked. I was hedging, stalling, and genuinely fascinated by how a very smart CEO for a company that managed a major real estate website had come to believe that a great result had come about—when in fact he'd had the wool pulled over his eyes.
I get that coding errors can be problematic for a site such as his and that they can quickly translate to big-dollar losses. Many people leaned on the company's website for the latest and most accurate information, and commissions depended on his people doing what they were supposed to do. So how had he gone for months with zero coding errors?
"We installed a big, big electronic sign in the office," he said, still beaming. "You know, with the big red numbers on it. With that, we could show everyone how many coding errors were made that month, and how many days they still had to go during that month."
I knew immediately where he'd gone wrong, and I'm hoping you won't fall into the same trap.
The zero-error fallacy
The CEO was quite excited about the zero-error result. “It was really cool, almost like a competition. You know, coders competing with themselves, trying to beat the number from last month. It was a fantastic way to hold them accountable, and for them to hold each other accountable."
“I have no trouble demanding excellence and accountability from my people. The first month they scored zero, I ordered pizza for the whole team. They loved it! And they loved me." He leaned back and folded his arms across his chest. He was the perfect picture of self-satisfaction.
"How did you find out about the errors that were made?" I asked. "Oh, through self-reporting, and people reporting each other when they found something, either directly in the code, or through our testing, or because something wasn't working properly—you know," he said.
I have worked in, and with, high-risk industries. Many of them had similar illusions about the absence of danger and non-existence of problems just because some countable aspect of their operation had become lower.
[ Special Coverage: DevOps Enterprise Summit 2017 ]
And I haven't been just a bystander to this phenomenon. I used to fly big jets, with 189 people on board who expected to arrive alive at the other end.
What I learned from that experience was that error counts don’t matter. The greatest illusion of all is that the difference between excellent and crappy operations is the number of errors or failures or mishaps or violations, or some other negative, countable property. It is an illusion, a myth that is demonstrably false.
Take Deepwater Horizon. The floating drilling rig had many years of supposed injury-free and incident-free performance. Then disaster struck, and 11 people were killed in the biggest man-made oil spill in human history. The zero count on injuries and incidents had predicted nothing.
But it did have an effect. The record of zero injuries leading up to the event created an illusion of safety when the opposite was true. Researchers at MIT have shown that the more incidents an airline has, the lower the passenger mortality risk, and construction sites with relatively more incidents in a given year have fewer worker deaths than those with no reported incidents.
So the fact that a real estate website had no coding errors reported over the past month has no bearing on whether the site is likely to have a big meltdown in the future.
Stop counting: Here’s what matters
So what does make the difference when it comes to ensuring quality and reducing the probability of failure? I once worked with a healthcare system that had stabilized its performance to one adverse event per 13 care encounters. In other words, for every 13 patients who walked in the doors, one would walk out in worse shape. Some died. We called this "iatrogenic harm," or harm caused in the process of providing care and cure.
This healthcare system did a lot of work investigating the one care encounter that went wrong. And of course, the investigations turned up the usual suspects: errors, violations, codes of practice and guidelines not followed, communication failures, calculation mistakes—those sorts of things.
Then we asked: "Do you know why the other 12 care encounters go right?" They did not. They presumed it was due to the complete absence of errors or violations, because codes of practice and guidelines had been followed, and because there were no communication failures or calculation mistakes.
That's a nice, soothing thought: Have your people pursue excellence, make no mistakes, be accountable, and you'll get perfect results. Declare war on error! Pursue perfection! It's all been told (and sold, to unsuspecting CEOs who like the idea of supposedly accountable people and zero bad things).
And yet it didn't make any difference. After weeks of study, we established that in the 12 that went right, there were pretty much as many errors, violations, codes of practice and guidelines not followed, communication failures, and calculation mistakes as in the one that went wrong.
Those things didn't make a difference. They didn't discriminate between failure and success. What made the difference was not the absence of countable negatives.
Don’t hide bad news
What made the difference was the presence of positive capacities—in people, in teams, in the organization. Some of these capacities included:
- Not taking past success as a guarantee of future safety.
- Being open to dissenting opinions and deliberately building diversity in teams.
- Keeping a discussion about risk alive even when everything looked safe.
- Being open to hearing about, and sharing, mistakes and miscues.
- Possessing the capacity to say "no" in the face of acute production pressures.
- Having leaders have with a well-calibrated sense of what it takes for their people to create success (even if their people routinely make it look effortless and smooth), despite the inevitable goal conflicts, resource constraints, and performance pressures.
Incentivizing the hiding of bad news is just about the stupidest thing a manager or CEO can do.
The solution: Develop a safety culture
A safety culture allows the boss to hear bad news. It is one in which the boss actually invites bad news, and may even reward it. The hunt for errors, failures, and attempts to squash them—in a false belief that it holds people accountable and drives them to pursue perfection and excellence—won't lead to anything more than a silent, sterile, superficially placid organization where the real failure is brewing just below the surface, safely out of sight from the people in charge.
Until it isn't.
I told the CEO of the company that ran the real estate website this, and showed him the data from other fields more risky than his own. He nodded, and then asked most of his managers to exit the room, leaving only him, me, and a few trusted colleagues.
In the conversation that ensued, we talked about hunting for success rather than failure, how people actually get stuff done, what in the operation really would be predictive of failure and success, and how to get people to talk about their mistakes and learn from them without negative consequences for them, their reputation, or career. Now, that would be real accountability, I told him. We also talked about how not to focus on the silly, countable, obvious stuff.
As it turned out, his trusted managers clearly had seen these things play out, but had not had the courage or opportunity to tell him. Even from them, he wasn't used to hearing bad news.
I don't know whether the electronic coding error sign was ever consigned to the scrap heap. I hope so. If it wasn't, I can only hope some coder hacked it. I wonder what he'd make it show.
Want to know more about the human factor in DevOps failure and success? Drop in on my presentation at DevOps Enterprise Summit, where I'll talk more about counting what really counts.