In a recent interview, John Allspaw (@allspaw), vice president of technical operations at Etsy and a speaker at Velocity 2011, talked about how resilience engineering can be applied to web environments. Allspaw said the typical “name/blame/shame” postmortem meeting isn’t an effective approach. “Second stories” are where the real vulnerabilities and solutions will be found.
Allspaw elaborates in the following Q&A.
What is resilience engineering?
John Allspaw: In the past 20 years, experts in the safety and human factors fields have been crystallizing some of the patterns that they saw when investigating disasters and failures in the “high risk” industries: aviation, space travel, chemical manufacturing, healthcare, etc. These patterns formed the basis for resilience engineering. They all surround the concept that a resilient system is one that can adjust its functioning prior to, during, and after an unexpected or undesired event.
There is a lot that web development and operations can learn from this field because the concepts map easily to the requirements for running successful systems online. One of the pieces of resilience engineering that I find fascinating is in the practical realization that the “system” in that context isn’t just the software and machines that have been built to do work, but also the humans who build, operate, and maintain these infrastructures. This means not only looking at faults — or the potential for faults — at the component level, but at the human and process level as well.
This approach is supported by a rich history of complex systems enduring unexpected changes only because of operator’s adaptive capacities. I don’t think I’ve felt so inspired by another field of engineering. As the web engineering discipline matures, we should be paying attention to research that comes from elsewhere, not just in our own little world. Resilience engineering is an excellent example of that.
How is resilience engineering shaped by human factors science?
John Allspaw: To be clear, I consider myself a student of these research topics, not an expert by any stretch. Human factors is a wide and multidisciplinary field that includes how humans relate to their environments in general. This includes how we react and work within socio-technical systems as it relates to safety and other concerns at the man-machine boundary. That’s where cognitive and resilience engineering overlap.
Resilience engineering as its own distinct field is also quite young, but it has roots in human factors, safety and reliability engineering.
Does web engineering put too much focus on tools?
John Allspaw: While resilience engineering might feel like scenario and contingency planning with some academic rigor behind it, I think it’s more than that.
By looking at not only what went wrong with complex systems, but also what went right, the pattern emerges that no matter how much intelligence and automation we build into the guts of an application or network, it’s still the designers and operators who are at the core of what makes a system resilient. In addition, I think real resilience is built when the designers and operators themselves are considered a critical part of the system.
In web engineering the focus is too often on the machines and tooling. The idea that automation and redundancy alone will make a system resilient is a naive one. Really successful web organizations understand this, as do the high-risk industries I mentioned before.
What are the most common human errors in web operations? What can companies do to avoid them?
John Allspaw: Categorizing human errors is in and of itself an entire topic, and may or may not be useful in learning how to prevent future failures. Lapses, slips, and violations are known as common types, each with their own subtypes and “solutions,” but these are very context-specific. Operators can make errors related to the complexity of the specific task, their training or experience, etc. I think James Reason’s “Human Error” could be considered the canonical source on human error types and forms.
There’s an idea among many resilience engineering researchers that attributing root cause to “human error” is ineffective in preventing future failures, for a number of different reasons. One is that for complex systems, failures almost never have a single “root” cause, but instead have multiple contributing factors. some of those might be latent failure scenarios that pre-existed but never exercised until they combined with some other context.
Another reason that the label is ineffective is that it doesn’t result in a specific remediation. You can’t simply end a postmortem meeting or root cause analysis with “well, let’s just chalk it up to human error.” The remediation items can’t simply be “follow the procedure next time” or “be more vigilant.” On the contrary, that’s where you should begin the investigation. Actionable remediation items and real learning about failures emerge only after digging deeper into the intentions and motivations for performing actions that contributed to the unexpected outage. Human error research calls this approach looking at “second stories.”
Documenting what operators should have done doesn’t explain why it made sense for them to do what they did. If you get at those second stories, you’ll be learning a lot more about how failures occur to begin with.
How can companies put failure to good use?
John Allspaw: Erik Hollnagel has said that the four cornerstones of resilience engineering are: anticipation, monitoring, response, and learning.
Learning in that context includes having a postmortem or post-issue meeting or process void of a “name/blame/shame” approach. Instead, it should be one that searches for those second stories and improves a company’s anticipation for failure by finding the underlying systemic vulnerabilities.
Responding to failure as it’s happening is interesting to me as well, and patterns can be found in the research across industries. Troubleshooting complex systems is a difficult business. Getting a team to calmly and purposefully carry out troubleshooting under time, money, and cultural pressures is an even tougher problem, but successful companies are good at it. My field can always be better at building organizational resilience in the face of escalating situations.
Postmortems are learning opportunities, and when they’re done well, they feed back into how organizations can bolster their abilities to anticipate, look for, and respond to unexpected events. They also provide rich details on how a team’s response to an outage can improve, which can point to all sorts of non-technical adjustments as well.
Should “resilience” be a management objective?
John Allspaw: That’s the obvious conclusion made from the high-risk industries, and I think it’s intuitive to think that way in online businesses. “Faster, cheaper, better” needs to be augmented with “more resilient” in order to get a full view of how an organization progresses with handling unexpected scenarios. We see successful companies taking this to heart. On the technical front, you see approaches like continuous deployment and gameday exercises, and on the business side we’re starting to see postmortems on business decisions and direction that are guiding design and product road maps.
This interview was edited and condensed.