O'Reilly Radar Podcast: Learning from both failure and success to make our systems more resilient.
Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In this week’s Radar Podcast episode, I chat with Dave Zwieback, head of engineering at Next Big Sound and CTO of Lotus Outreach. Zwieback is the author of a new book, Beyond Blame: Learning from Failure and Success, that outlines an approach to make postmortems not only blameless, but to turn them into a productive learning process. We talk about his book, the framework for conducting a “learning review,” and how humans can keep pace with the growing complexity of the systems we’re building.
When you add scale to anything, it becomes sort of its own problem. Meaning, let’s say you have a single computer, right? The mean time to failure of the hard drive or the computer is actually fairly lengthy. When you have 10,000 of them or 10 million of them, you’re having tens if not hundreds of failures every single day. That certainly changes how you go about designing systems. Again, whenever I say systems, I also mean organizations. To me, they’re not really separate.
I spent a bunch of my time in fairly large-scale organizations, and I’ve witnessed and been part of a significant number of outages or issues. I’ve seen how dysfunctional organizations dealing with failure can be. By the way, when we mention failure, it’s important for us not to forget about success. All the things that we find in the default ways that people and organizations deal with failure, we find in the default ways that they deal with success. It’s just a mirror image of each other.
We can learn from both failures and success. If we’re only learning from failures, which is what the current practice of postmortem is focused on, then we’re missing … the other 99% of the time when they’re not failing. The practice of learning reviews allows for learning from both failures and successes.
An argument against the Five Whys and an alternative approach you can apply.
Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it. This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.
The concept of post-hoc explanation (or “postmortems” as they’re commonly known) is, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the new view on “human error” are becoming more widely known and that people are looking to explore their own narratives through those lenses.
I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.
The Five Whys is such an approach that I think we should discard.
In this episode, John Allspaw talks in-depth about blameless postmortems and creating a just culture.
When you’re dealing with complex systems, failure is going to happen; it’s a given. What we do after that failure, however, strongly influences whether or not that failure will happen again. The traditional response to failure is to seek out the person responsible and punish them accordingly — should they be fired? Retrained? Moved to a different position where they can’t cause such havoc again?
John Allspaw, SVP of technical operations at Etsy and co-chair of the O’Reilly Velocity Conference, argues that this “human error” approach is the equivalent of cutting off your nose to spite your face. He explains in a blog post that at Etsy, their approach it to “view mistakes, errors, slips, lapses, etc., with a perspective of learning.” To that end, Etsy practices “blameless postmortems” that focus more on the narrative of how something happened rather than who was behind it, and that remove punishment as an outcome of an investigation.