Building for failure is a recipe for success

How you handle failure can mean the difference between "just another incident" and a revenue-stealing accident.

I was ready to get home. I’d been dozing throughout the flight from JFK to SFO, listening to the background chatter of Channel 9 as a lullaby. Somewhere over Sacramento, the rhythmic flow of controller-issued clearances and pilot confirmations was broken up by a call from our plane:

“NorCal Approach, United three-eighty-nine.”
“United three-eighty-nine, NorCal, go.”
“NorCal, United three-eighty-nine, we’d like to go ahead and…”

My headphones went silent, Channel 9 shut off.

I didn’t think too much of it as we continued our descent, flight attendants walking calmly through the cabin, getting us ready for landing. I had noticed our arrival path was one I was unfamiliar with, but nothing else seemed out of the ordinary… until we turned onto the final approach. In the turn, I noticed the unmistakable glint of firetrucks’ rotating red lights, lined up alongside the runway.

So far in this series of posts, we’ve talked sustainable, useful process; we’ve talked about talking; and we’ve talked about leveraging those processes and communication strategies to be able to set expectations to improve chances of reasonable outcomes. But sometimes, no matter how much your team tries, things don’t quite turn out as planned… and then it’s time for the last “-ation” of aviation: remediation.

Failure is Here to Stay

One of emerging trends in the DevOps space is the idea that failure is inevitable. Software, with its reliance on hardware (which, these days, itself is made up of layers of microcode), third-party libraries, and the interplay between our product’s own components certainly qualifies as complex. Add the additional hardware and software required to operate at “web-scale,” along with the sociological dynamics of development, operations, and management teams, and you have the textbook complex system.

Complex systems, like their simpler siblings, can obviously experience various sorts of failure. What makes them interesting, unlike simpler ones, is that it’s not a matter of whether they will experience failures, but a matter of when and to what degree. Sociologist Charles Perrow’s seminal work on the topic introduced the term “normal accidents” (in his book by the same name) to hit home this idea that in these sorts of systems, accidents aren’t aberrations, but rather the order of the day.

Failure is a Feature; Remediation is the Bug

If we accept that unanticipated failures will occur in the complex systems we develop and operate, it can be easy to throw up our hands and say “Well, let’s ignore it; we’ll tackle it with an all-hands-on-deck response when it happens.” And often, this translates to everyone keeping in the back of their mind that they should expect to be on call all the time. But there are some good lessons we can take from aviation to prepare for these harrowing times that don’t involve giving up on quality and keeping your engineers and ops staff on edge.

Incorporate failure responses into normal flows

A required part of each instrument approach in aviation is the “missed approach segment.” This is effectively outlining what to do if Plan A–landing–doesn’t work out. It describes where to go and, as when we discussed last week, creates expectations for pilots and controllers should they lose two-way, so-called “active” contact. They also provide fast, short-form communication: “Execute the ILS-28-left missed approach” is significantly faster to say than “Climb to 600 feet; climbing right turn to 3,000 feet on a 285 heading, to join the SFO 280 radial; hold, as published, at OLYMM intersection,” which is the equivalent statement.

Quick, exacting communication is great, but there’s a cultural benefit: pilots are taught to assume that every landing will end in an aborted approach (what’s called a go-around in aviation). When you acculturate system operators to expect the unexpected and incorporate “plan B” into regular processes, failure is de-fanged and the initial responses to commonplace error states is not panic, but familiarity and a confidence that comes with knowing what to do next.

Loose coupling is difficult, but preferable

One of the allures of service-oriented architectures is the inherent design principle of loosely-coupled components. Done properly–with sane versioning, managed SLAs, and the required coordination to string the components together into a useful product–they’re not cheaper than tightly-coupled systems, but they have proven to be more resilient to the types of unexpected changes and unknown dependencies that induce the failure we encounter in our industry. Perrow describes the dichotomy thusly: “Loosely coupled systems … can incorporate shocks and failures and pressures for change without destabilization. Tightly coupled systems will respond more quickly to these perturbations, but the response may be disastrous.”

Emergencies get priority

In aviation, when a pilot declares an emergency, an alternate world of procedures goes into effect. It’s fascinating to watch aviation’s emergency response, especially from a system-wide perspective. Another often-touted feature of the service-oriented model is that it allows teams to deploy on their own schedule, without being gated by anyone. But should everyone have the freedom to ship when the service is down or degraded, due to a yet-to-be-determined problem? Probably not.

In urgent conditions, even small aircraft can push the big boys out of the way. (I have actually experienced this once while flying; sorry Air Canada!). It’s more than common courtesy: it is acknowledgment that the nature of complex systems is such that small/non-critical components can indeed affect big ones, and sometimes, for the benefit of the system, we must gate operations to reduce complexity while in the midst of a failure-condition.

For induced failures, swap out operators

If an engineer has induced a failure within the system, if at all possible, they should communicate their understanding of the situation, and transfer the responsibility of dealing with the incident to someone else. Often times, when errors are induced by operators of complex systems, an effect Perrow calls “incomprehensibility” occurs: the feeling that it’s inconceivable that their actions could have caused the results they’re seeing. Dealing with the incomprehensibility of the situation involves a tremendous cognitive load, so much so that the person responsible for dealing with the situation may distract themselves, increasing the chance that they’ll miss important information that could solve the problem, or worse, cause more errors to compound the failure.

Another reason to swap operators out: Harvard psychologists have started researching a class of human error called “counter-intentional error.” It’s the phenomena we’ve all probably experienced where a friend tells us to be careful about something in particular, and we end up doing that exact thing, precisely because we’re so focused on not doing it. So when possible, we should relieve those involved in operational failures, not for punitive reasons, but because once in an error state, it’s difficult for our brains to break out of that mode while caught up in the moment. It’s also good advice for no other reason that a fresh pair of eyes on the problem can be worth a lot.

Healthy postmortems are not a luxury

No-blame, actionable postmortems are becoming a core tenet of DevOps, and there are numerous places to find advice. Two interesting aspects aviation has uncovered: postmortems are not run by the agency responsible for governing air traffic (the Federal Aviation Administration); they’re run by the National Transportation Safety Board. The separation of responsibilities results in some interesting findings at times, which the FAA isn’t incentivized to discover, and arguably may not have ever found. It’s often hard to look in the mirror, so if your organization is big enough, having a separate team gather the postmortem information may be worth it. If not, even having someone from an uninvolved, but related team can be helpful.

Complex system failures, by definition, are never the result of non-hostile actors interacting with the system. So while postmortems can certainly focus on, say, “pilot error,” the real answer is always more complex. This is the “five whys” principle in action. Postmortems suffer from what I call the postmortem paradox: they must walk the fine line of being taken seriously enough to produce actionable results to improve organizational and system behavior, but must not be taken so seriously that they are just a witch hunt in disguise.

An Uneventful Incident

When I got home from the airport, I pulled up LiveATC to see if I could discover what happened. Turns out the next words out of the pilot’s mouth were, “We’d like to go ahead and declare an emergency”–our plane had suffered an engine shutdown. Since we were already in the descent, the pilot elected to complete the flight, but requested a wide berth from air traffic control, which included an expedited arrival path and “rolling the equipment”–sending out the emergency vehicles. Listening to the back-and-forth reveals a mastery of incident response and (possible) disaster management. There’s a subtextual beauty to the exchange, especially when you consider that the majority of the passengers were none the wiser. (And as it played itself out, nor did they need be.)

It’s certainly the case that not all normal accidents will turn out this way; but since they’re bound to occur no matter what we do, by copying some of the techniques aviation uses, your organization has a higher probability of keeping an incident from blossoming into an accident, responding effectively to whatever your complex system throws at you, and incorporating the lessons learned for the next go-around.

tags: , , , ,

Get the O’Reilly Web Ops & Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.