What Is the Risk That Amazon Will Go Down (Again)?

Velocity 2013 Speaker Series

Why should we at all bother about notions such as risk and safety in web operations? Do web operations face risk? Do web operations manage risk? Do web operations produce risk? Last Christmas Eve, Amazon had an AWS outage affecting a variety of actors, including Netflix, which was a service included in many of the gifts shared on that very day. The event has introduced the notion of risk into the discourse of web operations, and it might then be good timing for some reflective thoughts on the very nature of risk in this domain.

What is risk? The question is a classic one, and the answer is tightly coupled to how one views the nature of the incident occurring as a result of the risk.

One approach to assessing the risk of Amazon going down is probabilistic: start by laying out the entire space of potential scenarios leading to Amazon going down, calculate their probability, and multiply the probability for each scenario by their estimated severity (likely in terms of the costs connected to the specific scenario depending on the time of the event). Each scenario can then be plotted in a risk matrix showing their weighted ranking (to prioritize future risk mitigation measures) or calculated as a collective sum of the risks for each scenario (to judge whether the risk for Amazon going down is below a certain acceptance criterion).

This first way of answering the question of what the risk is for Amazon to go down is intimately linked with a perception of risk as energy to be kept contained (Haddon, 1980). This view originates from more recent times of increased development of process industries in which clearly graspable energies (fuel rods at nuclear plants, the fossil fuels at refineries, the kinetic energy of an aircraft) are to be kept contained and safely separated from a vulnerable target such as human beings. The next question of importance here becomes how to avoid an uncontrolled release of the contained energy. The strategies for mitigating the risk of an uncontrolled release of energy are basically two: barriers and redundancy (and the two combined: redundancy of barriers). Physically graspable energies can be contained through the use of multiple barriers (called “defenses in depth”) and potentially several barriers of the same kind (redundancy), for instance several emergency-cooling systems for a nuclear plant.

Using this metaphor, the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.

Controlling risk by analyzing the complete space of possible (and graspable) scenarios basically does not distinguish between safety and reliability. From this view, a system is safe when it is reliable, and the reliability of each barrier can be calculated. However there is one system component that is more difficult to grasp in terms of reliability than any other: the human. Inevitably, proponents of the energy/barrier model of risk end up explaining incidents (typically accidents) in terms of unreliable human beings not guaranteeing the safety (reliability) of the inherently safe (risk controlled by reliable barriers) system. I think this problem—which has its own entire literature connected to it—is too big to outline in further detail in this blog post, but let me point you towards a few references: Dekker, 2005; Dekker, 2006; Woods, Dekker, Cook, Johannesen & Sarter, 2009. The only issue is these (and most other citations in this post) are all academic tomes, so for those who would prefer a shorter summary available online, I can refer you to this report. I can also reassure you that I will get back to this issue in my keynote speech at the Velocity conference next month. To put the critique short: the contemporary literature questions the view of humans as the unreliable component of inherently safe systems, and instead advocates a view of humans as the only ones guaranteeing safety in inherently complex and risky environments.

The idea that such environments are inherently risky reveals the second problem of the energy/barrier approach. The most-cited critic of the view that risk is best characterized (and calculated) as containable energy is Charles Perrow, who introduced his theory of Normal Accidents in the aftermath of the Three Mile Island meltdown (Perrow, 1984). In Perrow’s view, risk is also a structural property of system complexity—and that complexity increases with the introduction of barriers. Perrow argued (and still does) that more barriers—especially when tightly coupled in space and function, as would be the case with backup hydraulic systems of an aircraft that are so closely spaced that one explosion might take them all out)—introduce more possibilities for unanticipated interactions of the system, thereby increasing the potential for major accidents. The risk mitigation measures from one view on risk became a property of risk itself from another view. With risk as a structural property of system complexity, now what is the risk of Amazon going down?

But we are not done yet. Thus far, the risk of Amazon going down has mainly been discussed in terms of relatively static processes like containing energy or reducing structural complexity. What if risk is not located in the structural properties of the system, but is a much more dynamic phenomenon located in the social organizational processes? Several sociologists have characterized risk as a path-dependent organizational process. Notably, Turner and Pidgeon introduced the theory of risk based an organizational failure of intelligence over an “incubation period” leading up to the ultimate consequence: the man-made disaster (Turner & Pidgeon, 1978). For more on this, see also Vaughan’s seminal theory (based on her nine years of studying the explosion of the Challenger space shuttle) of how risk gets normalized in a culture of production (Vaughan, 1996), and Snook’s theory of “practical drift” of organizational process leading up to accidents, such as the friendly-fire accident in northern Iraq that he studied (Snook, 2000).

So based on all this, exactly where is the risk that Amazon goes down? In the energy? In the human unreliability? In the structural complexity of the web and the billions of actors connected to it? In the social and organizational processes inside Amazon? Who gets to say? Who do we give the power to construct the risk of Amazon going down? This is where Slovic comes in as my final reference. He does not care which way of framing risk is “true” (a rather meaningless concept in a post-modern society anyhow), but concludes that different actors will advocate different views and that risk then needs to be analyzed in terms of the game in which these actors engage (Slovic, 2001). As such, we may have asked the wrong questions to begin with.

My revised questions that I will discuss in my keynote include: If we view the risk of Amazon going down as a game to be played, how do we want that to happen? Which players do we want? Do we even agree on the rules of the game? This is where the notion of risk is turned into an issue of relations, interactions, and power.

References

Dekker, S.W.A. (2005). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J.: Lawrence Erlbaum Associates

Dekker, S.W.A (2006). The field guide to understanding human error. Aldershot, UK: Ashgate Publishing Co.

Haddon, W. (1980). The basic strategies for reducing damage from hazards of all kinds. Hazard Prevention, 16, 8-12

Perrow, C. (1984). Normal accidents: Living with high-risk technologies. New York: Basic Books

Slovic, P. (2001). The risk game. Reliability Engineering & System Safety, 59(1), 73-77

Snook, S. A. (2000). Friendly fire, the accidental shootdown of U.S. Black Hawks over northern Iraq. Princeton: Princeton University Press

Turner, B. A., & Pidgeon, N. F. (1997). Man-made disasters. Boston: Butterworth-Heinemann.

Vaughan, D. (1996). The challenger launch decision. Chicago: The University of Chicago Press

Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. B. (2009). Behind human error. Aldershot, UK: Ashgate Publishing Co.

This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

tags: , , , , , ,

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.