Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In this week’s Radar Podcast episode, I chat with Dave Zwieback, head of engineering at Next Big Sound and CTO of Lotus Outreach. Zwieback is the author of a new book, Beyond Blame: Learning from Failure and Success, that outlines an approach to make postmortems not only blameless, but to turn them into a productive learning process. We talk about his book, the framework for conducting a “learning review,” and how humans can keep pace with the growing complexity of the systems we’re building.
When you add scale to anything, it becomes sort of its own problem. Meaning, let’s say you have a single computer, right? The mean time to failure of the hard drive or the computer is actually fairly lengthy. When you have 10,000 of them or 10 million of them, you’re having tens if not hundreds of failures every single day. That certainly changes how you go about designing systems. Again, whenever I say systems, I also mean organizations. To me, they’re not really separate.
I spent a bunch of my time in fairly large-scale organizations, and I’ve witnessed and been part of a significant number of outages or issues. I’ve seen how dysfunctional organizations dealing with failure can be. By the way, when we mention failure, it’s important for us not to forget about success. All the things that we find in the default ways that people and organizations deal with failure, we find in the default ways that they deal with success. It’s just a mirror image of each other.
We can learn from both failures and success. If we’re only learning from failures, which is what the current practice of postmortem is focused on, then we’re missing … the other 99% of the time when they’re not failing. The practice of learning reviews allows for learning from both failures and successes.
In the practice of learning reviews—and of course, this is also present in the “blameless postmortems”—we don’t focus on a single root cause, but we focus on a bunch of conditions. That comes not out of anything other than the realization of the complexity of the systems that we work with.
In the current practice of postmortems, we talk about accountability, but really what that version of accountability means is, who’s throat are we going to choke. Who are we going to punish? … In a learning review, we go beyond blame to achieve real accountability. … If there’s blame, or there’s punishment, then you’re not going to get the full account. You really cannot fully hold people accountable.
The other sort of lineage of removing blame and punishment actually comes from a normal non-restorative or punitive justice system, where in certain situations, we give people immunity. Why? So that they can give us the full account of what happened. We do that sometimes with people we know have done bad things. In mafia cases. In those cases, what we are saying or doing by giving people immunity is that we value the information they provide to us more than pushing them.
Why do we want to go beyond blame, why do we want to go beyond bias? Those two are the short tickets to learning. … More importantly, it’s not a one time thing. We continually have to be learning about our systems and feeding that knowledge back into that system to make it more resilient.
Image by Raj srikanth800 on Wikimedia Commons.