Everything is distributed

How do we manage systems that are too large to understand, too complex to control, and that fail in unpredictable ways?

Complexity

“What is surprising is not that there are so many accidents. It is that there are so few. The thing that amazes you is not that your system goes down sometimes, it’s that it is up at all.”—Richard Cook

In September 2007, Jean Bookout, 76, was driving her Toyota Camry down an unfamiliar road in Oklahoma, with her friend Barbara Schwarz seated next to her on the passenger side. Suddenly, the Camry began to accelerate on its own. Bookout tried hitting the brakes, applying the emergency brake, but the car continued to accelerate. The car eventually collided with an embankment, injuring Bookout and killing Schwarz. In a subsequent legal case, lawyers for Toyota pointed to the most common of culprits in these types of accidents: human error. “Sometimes people make mistakes while driving their cars,” one of the lawyers claimed. Bookout was older, the road was unfamiliar, these tragic things happen.

However, a recently concluded product liability case against Toyota has turned up a very different cause: a stack overflow error in Toyota’s software for the Camry. This is noteworthy for two reasons: first, the oft-cited culprit in accidents—human error—proved not to be the cause (a problematic premise in its own right), and second, it demonstrates how we have definitively crossed a threshold from software failures causing minor annoyances or (potentially large) corporate revenue losses into the realm of human safety.

It might be easy to dismiss this case as something minor: a fairly vanilla software bug that (so far) appears to be contained to a specific car model. But the extrapolation is far more interesting. Consider the self-driving car, development for which is well underway already. We take out the purported culprit for so many accidents, human error, and the premise is that a self-driving car is, in many respects, safer than a traditional car. But what happens if a failure that’s completely out of the car’s control occurs? What if the data feed that’s helping the car to recognize stop lights fails? What if Google Maps tells it to do something stupid that turns out to be dangerous?

We have reached a point in software development where we can no longer understand, see, or control all the component parts, both technical and social/organizational—they are increasingly complex and distributed. The business of software itself has become a distributed, complex system. How do we develop and manage systems that are too large to understand, too complex to control, and that fail in unpredictable ways?

Embracing failure

Distributed systems once were the territory of computer science Ph.D.s and software architects tucked off in a corner somewhere. That’s no longer the case. Just because you write code on a laptop and don’t have to care about message passing and lockouts doesn’t mean you don’t have to worry about distributed systems. How many API calls to external services are you making? Is your code going to end up on desktop sites and mobile devices—do you even know all the possible devices? What do you know now about the network constraints that may be present when your app is actually run? Do you know what your bottlenecks will be at a certain level of scale?

One thing we know from classic distributed computing theory is that distributed systems fail more often, and the failures often tend to be partial in nature. Such failures are not just harder to diagnose and predict; they’re likely to be not reproducible—a given third-party data feed goes down or you get screwed by a router in a town you’ve never even heard of before. You’re always fighting the intermittent failure, so is this a losing battle?

The solution to grappling with complex distributed systems is not simply more testing, or Agile processes. It’s not DevOps, or continuous delivery. No one single thing or approach could prevent something like the Toyota incident from happening again. In fact, it’s almost a given that something like that will happen again. The answer is to embrace that failures of an unthinkable variety are possible—a vast sea of unknown unknowns—and to change how we think about the systems we are building, not to mention the systems within which we already operate.

Think globally, develop locally

Okay, so anyone who writes or deploys software needs to think more like a distributed systems engineer. But what does that even mean? In reality, it boils down to moving past a single-computer mode of thinking. Until very recently, we’ve been able to rely on a computer being a relatively deterministic thing. You write code that runs on one machine, you can make assumptions about what, say, the memory lookup is. But nothing really runs on one computer any more—the cloud is the computer now. It’s akin to a living system, something that is constantly changing, especially as companies move toward continuous delivery as the new normal.

So, you have to start by assuming the system in which your software runs will fail. Then you need hypotheses about why and how, and ways to collect data on those hypotheses. This isn’t just saying “we need more testing,” however. The traditional nature of testing presumes you can delineate all the cases that require testing, which is fundamentally impossible in distributed systems. (That’s not to say that testing isn’t important, but it isn’t a panacea, either.) When you’re in a distributed environment and most of the failure modes are things you can’t predict in advance and can’t test for, monitoring is the only way to understand your application’s behavior.

Data are the lingua franca of distributed systems

If we take the living-organism-as-complex-system metaphor a bit further, it’s one thing to diagnose what caused a stroke after the fact versus to catch it early in the process of happening. Sure, you can look at the data retrospectively and see the signs were there, but what you want is an early warning system, a way to see the failure as it’s starting, and intervene as quickly as possible. Digging through averaged historical time series data only tells you what went wrong, that one time. And in dealing with distributed systems, you’ve got plenty more to worry about than just pinging a server to see if it’s up. There’s been an explosion in tools and technologies around measurement and monitoring, and I’ll avoid getting into the weeds on that here, but what matters is that, along with becoming intimately familiar with how histograms are generally preferable to averages when it comes to looking at your application and system data, developers can no longer think of monitoring as purely the domain of the embattled system administrator.

Humans in the machine

There are no complex software systems without people. Any discussion of distributed systems and managing complexity ultimately must acknowledge the roles people play in the systems we design and run. Humans are an integral part of the complex systems we create, and we are largely responsible for both their variability and their resilience (or lack thereof). As designers, builders, and operators of complex systems, we are influenced by a risk-averse culture, whether we know it or not. In trying to avoid failures (in processes, products, or large systems), we have primarily leaned toward exhaustive requirements and creating tight couplings in order to have “control,” but this often leads to brittle systems that are in fact more prone to break or fail.

And when they do fail, we seek blame. We ruthlessly hunt down the so-called “cause” of the failure—a process that is often, in reality, more about assuaging psychological guilt and unease than uncovering why things really happened the way they did and avoiding the same outcome in the future. Such activities typically result in more controls, engendering increased brittleness in the system. The reality is that most large failures are the result of a string of micro-failures leading up to the final event. There is no root cause. We’d do better to stop looking for one, but trying to do so is fighting a steep uphill battle against cultural expectations and strong, deeply ingrained psychological instincts.

The processes and methodologies that worked adequately in the ’80s, but were already crumbling in the ’90s, have completely collapsed. We’re now exploring new territory, new models for building, deploying, and maintaining software—and, indeed, organizations themselves. We will continue to develop these topics in future Radar posts, and, of course, at our Velocity conferences in Santa Clara, Beijing, New York, and Barcelona.

Photo by Mark Skipper, used under a Creative Commons license.

Related:

tags: , , , , , , ,

Get the O’Reilly Web Ops & Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

  • http://ducknetweb.blogspot.com/ Medicalquack

    Complexities mean profits too and it’s designed that way. I have a collection of videos created by people smarter than me, some Quants and others. What better way to hear and get educated then from the folks who write them. I call it the Attack of the Killer Algorithms/Algo Duping page.

    http://www.ducknet.net/attack-of-the-killer-algorithms/

    I’m not that smart but used to write software myself and got fooled by my own proof of concepts too and now with complexities it’s worse and flat out, people don’t work that way, thus the glut of complexities as applying new layers to an already flawed model is not fixing things.

    http://ducknetweb.blogspot.com/2014/05/people-dont-work-that-way-world-of.html

  • mhaeberli

    Nice article – but, perhaps showing how hard it is to get software right (English is hard, too), it’s “emergency BRAKE”, not “emergency break”…

    • http://radar.oreilly.com/ Mac Slocum

      Thanks for the catch. The correction has been made.

  • Alex Tolley

    I didn’t know what language Toyota coded the control system, but I would hazard a guess from the description that it wasn’t a language that could have caught such an error at compile time, like Ada. Secondly, we already know how to reduce these errors – redundancy. As systems become more complex, we need to understand that they cannot be fully understood and that we need to build in redundancy in a variety of ways to make it more resistant to failovers.

  • http://www.xaprb.com/ Baron Schwartz

    I don’t disagree with any of the points made in this story, but I am not sure I would phrase the Toyota story this way: “it demonstrates how we have definitively crossed a threshold… human safety” Somehow the phrasing makes it sound like something new has occurred in this case. We crossed that threshold decades ago. If there’s an engineering curriculum that doesn’t feature case studies to that effect, I’d be surprised. See for example the classic http://en.wikipedia.org/wiki/Therac-25

  • bogd

    One thing I do not understand – you talk about a “stack overflow” as being the cause of the acceleration. And link to a 400+ page document as proof.

    Could you tell us where exactly in that document does it say that the acceleration was caused by a stack overflow?

    As far as I know, the cause was never identified, and this whole case was based on “well, their code was poorly written, and IN THEORY a stack overflow COULD have happened, and MIGHT have caused this” (reference: http://en.wikipedia.org/wiki/2009%E2%80%9311_Toyota_vehicle_recalls )

  • AnthroPunk

    What you are describing is the outcome from what we call PolySocial Reality (PoSR).

    PoSR describes the communications connections (human/human, human/machine, machine/machine) that are, or are not being made as humans and machines shift from synchronous to asynchronous time. We describe the need for new ways of thinking about software development in our later papers, most likely of value to Radar readers are:

    (Applin and Fischer 2012) PolySocial Reality: Prospects for Extending User Capabilities Beyond Mixed, Dual and Blended Reality:

    http://www.dfki.de/LAMDa/2012/accepted/PolySocial_Reality.pdf

    and

    (Applin and Fischer 2013) Thing Theory: Connecting Humans to Location Aware Smart Environments:

    http://www.dfki.de/LAMDa/2013/accepted/13_ApplinFischer.pdf

    and

    (Applin and Fischer 2012) Applied Agency: Resolving Multiplexed Communication in Automobiles (pp. 159-163):

    http://www.auto-ui.org/12/docs/AutomotiveUI-2012-Adjunct-Proceedings.pdf

    More papers at: http://posr.org/wiki/Publications

    See also: http://posr.org/wiki/Main_Page