Failure Happens: A summary of the power outage at 365 Main

Datacenter provider 365 Main released their initial report from Tuesday’s power failure which affected Craigslist, Technorati, Yelp, TypePad, LiveJournal, Vox, and others. This outage is an excellent example of complex systems failure, and so I’ll be using it as the basis for my next few posts on Operations. This is my own analysis using publicly available data.

The 365main site does not have a typical battery backup system. Instead they rely on Continuous Power Supplies (CPS) which use a flywheel driven alternator to generate electricity.

The flywheel is connected to both a large diesel motor and an electric motor which runs on utility power. The flywheel is normally turned by the electric motor, and stores enough kinetic energy to power the alternator for up to 15 seconds. When utility power fails the diesel motor is supposed to start in under 5 seconds, well before the flywheel’s kinetic energy is exhausted, providing uninterrupted electrical power.

The advantage of a CPS over a battery-based system is that the power going to the datacenter is decoupled from the utility power. This eliminates the complex electrical switching required from most battery-based systems, making many CPS systems simpler and sometimes more reliable.

In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.

365main-power-systems-jesse-robbins-2.png

What’s most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being “nearly impossible”. 

Unfortunately, the impossible often becomes reality.

To put it another way… Failure Happens.

Next week we’ll dive into building resilient websites and take a look at a few of the sites that went down. Artur and I are both excited to be writing about this, and welcome your comments, suggestions, and war stories!

tags: , , , , ,

Get the O’Reilly Web Ops & Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

  • http://www.markmcspadden.net Mark

    Jesse,

    Great breakdown and overview. As a developer, I’m very interested in the follow up article and hope to understand what some of these sites could have done (and plan to do in the future) to avoid such an incident.

    I’d also be interested in knowing if any of these companies had Incident Response Plans in place and if so, how they worked. (I seriously doubt Twitter was in the Six Apart IRP, which probably shows a lack of planning all-together.)

    Can’t wait for you and Artur to enlighten us!

  • steve

    As someone who builds large data centres, I find this analysis is going in a disturbing direction. You’re claiming that “failure happens” – which is true for any single location.

    What is not acceptable is the lack of multi-site redundancy, and there should have not been any outage. Even if 365’s primary data centre had been nuked, there should not have been any outage.

    Please do not overlook the forest for the trees, making your posts into an overly detailed analysis of what happened as a way of excusing why it happened. Multi-site redundancy is a requirement, and there was no multi-site redundancy.

    365 Main failed to operate their data centre in a competent manner.

  • Future

    Next step: Distributed (peer-to-peer) dynamic web pages ?

  • http://radar.oreilly.com/jesse Jesse Robbins

    Steve, you make a good point. In this case, 365 Main is the datacenter that failed, taking out a number of sites for which it was a single point of failure.

    I wanted to begin here because we’re trying to change the common myth among the inexperienced that “big expensive datacenters don’t fail”.

  • http://kitchensoap.com John Allspaw

    Excellent post.

    SPOF reduction is a process, not just a one-time exercise. :)

  • Thomas Lord

    It’s a design failure. There is absolutely no excuse for switching circuitry of that tiny scale to be designed without regard to whether or not it will overload one of the generators. The “site redundancy” issue is separate and, honestly, more nuanced than Steve suggests. This case sounds like just, straight up, bad circuit design.

    -t

  • steve

    Hi Thomas,

    There are multiple points of potential failure, and any of them can be individually simple. Even if it is just a bad circuit design, the fact that it can (and did) go wrong is the reason why the overall approach needs to compensate for those issues.

    Jesse has the right idea here… changing the common myth that big datacentres don’t fail. They do fail, and they will always fail. Even if you ran a datacentre and made sure that the circuit design was impeccable, there’s always something else that will go wrong.

    When (not if) your data centre goes down, there will be someone on a blog somewhere posting that it was a simple error in Widget X because of Factor Y. All causes are simple in hindsight, and the procedures and solutions need to cater for some degree of inability at the lower levels.

    Multi-site redundancy is a way of saying “I’ve done the best I can, but failure happens. So bung another one in”. After all, even the best circuit design is going to have a hard time coping with a nuclear blast.

  • Thomas Lord

    I see your point. You could also say it that: even if we are so slick and gifted as engineers that we make boneheaded mistakes only 0.001% of the time,if the system has even as few as 100 engineering decisions behind it, we won’t last a year without seeing a failure. Redundancies and fail-over plans give us a little help though — if I triple my provisioning in the form a perfect redundancy/fail-over circuit, now I’ll see 3 failures a year, not just 1 — but in theory, the three have to be simultaneous to actually take out the service.

    That way to buffer risk with redundancy is a very important design principle. It works all up and down the fractal scale and across the horizontal axes. You have redundant generators, at one scale, and redundant data centers at a bigger scale, and redundant bits in your ram and on disk at a smaller scale. And you want a kind of orthogonal redundancy horizontally — say redundant generators in one geographic spot but also redundant geographic spots.

    If I understand the matter of this case: the problem went from “nasty” to “embarassingly bad” exactly when (at one particular fractal scale) their implementation of fail-over did something bone-headed and, instead of tossing some customers overboard directly, attempted the impossible of serving them all from a single generator and thus tossed them all overboard accidentally.

    The same kind of mistake easily occurs with site redundancy, too.

    And, redundancy at any scale is very expensive. So, there’s just a very, very hard budget game there, trying guesstimate what will stand.

    Single-point-of-failure analysis is not enough these days, anyway. Nature itself, along with a helping hand from all the popular kids in the murding nihilist crowd are, these days, practicing up on “swarm attacks” — where complex networks are dinged with large numbers of “small insults”, closely clustered in time, scattered all over the network. So, a hurricane there, a cable cut here, perhaps a crescent wrence falls onto a trasformer somewhere else…. This does two things: First, a swarm of “small insults” doesn’t cost much to create but costs a lot to recover from. Second, it’s like a “shake test” for all of those beautifully conceived systems of redundancy and fail-over — it’s a good way to start just dialing up “unanticipated inputs” to lots of nodes on the network at once, so if you want create a cascade that takes advantage of all those latent bugs in the implementation of fail-over, well, there’s your strategy.

    I think we need a more holistic, more grounded concept of redundancy. For example, if you add this new thing to my life — Google — well, it’s big and complicated and very, very useful, right? My first question is not about how robustly we guesstimate they’ve built it out. That’s not my first question because, before it arrived, I didn’t depend on Google in any way. So, my first question is: how can I build a workable substitute for this thing, but without all the huge expense and complexity? It’s the end-to-end redundancy that matters — internal redundancies in the network don’t matter if they fail to afford that end-to-end redundancy.

    We have trouble thinking that way in the market though. The big, complex, monolithic services are hairy but, paradoxically, they are the easiest kinds of things to build if you’re planning to spend a lot of money anyway. Those centralized business models poison their product with risk because they can translate that risk into marginal profit very efficiently — this is the main source of the big returns on centralized web services.

    The remedy requires two things: The situation calls for some creativity in the financial markets — to make large sums easier to administer even if the aim is “many small investments” instead “a few big investments”. The situation calls for some new design patterns, techniques, standard procedures, standard analyses, etc. for engineers — because decentralized services are different in function not just form — we have learn how to create value competitively with the centralized services but using decentralized constraints that mean we can’t exactly duplicate the centralized functionality.

    Maybe someday most american towns and cities will own or lease some nearby hyper-power-efficient data store and server farms, toss on some open source software, join a data trading consortium to share crawling results with others (sorta-like the old UUNET) — and 10,000 little google’s will bloom, perhaps as a service of public libraries.

    The failover circuit for that arrangement? The user picks a different provider and sticks that URL in the search box. A lot less error prone.

    -t

  • jose

    I add my interest about the response plan. Does “365 Main” have one?. It would be interesting to see an article on how they try to solve this situation.

  • http://www.cutcaster.blogspot.com/ john

    It failed and it didn’t look like 365 had a great response plan but at least they will learn from this letdown.

  • http://www.wardboland.com JPeter Ward

    Companies that totally rely on a flywheel (30 seconds of backup or so) are somewhat short sighted.

    I believe in flywheels, when used in parallel with a string of batteries. It gives the generator an extra (what-if) in case it does not fire on time.

  • http://kitchensoap.com John Allspaw

    “And, redundancy at any scale is very expensive. So, there’s just a very, very hard budget game there, trying guesstimate what will stand.”

    this, IMHO is the real question. These sites have lots of very smart people working there, who know the risks of not running their sites with multiple datacenters, and I suspect chose not to for economic reasons.

    Saying something like: “OMGWTF! They should serve craigslist.com from multiple places so this wouldn’t happen!” is a little like saying the city of Denver should own a few snowplows in case there’s a blizzard.

  • steve

    John; amusing, but the fact is that multi-site redundancy is such a basic requirement that it’s hard to pass even something as simple as ISO9001 if you don’t have it.

    Yes, the economic reason involved is that the hosting company has a higher profit margin if they can convince all their users to rely on a single point of failure. I am quite frankly amazed that this was possible.

    That is one of my problems with the idea of hosting applications with the providers, as the clients don’t have the ability to audit the facilities in the same way that they would audit their own facilities.

    All elements of their operation have to be brought in-house to ensure basic checklist items like backups, redundancy, and the more detailed aspects of specific accreditations.

    I predict it won’t be long before we see something like 365 Main with dysfunctional backups. In fact, this has happened to a lesser degree with lost emails and files from various providers like Yahoo and Google.

  • Hendrik

    About 10 years ago, while I was studying Formal Computer Science and the logics associated with them, a lot of work was being done on the analysis of concurrent systems and expanding that work to multi-point failure analysis in order to improve the integrity of these systems. (If memory serves this was being used by the European Space Program and Prof. Holger Schlingloff was one of the main researchers.)

    Do you know whether this kind of work has been applied to Data Centres and the similar problems of concurrent data storage?

  • John Allspaw

    steve: or, the economic reason could be that the sites saw that being exposed to an potential outage for X hours (and the revenue loss associated) is worth it compared to the infrastructure and engineering costs in deploying a second (or 3rd, or 4th, etc.) datacenter.

  • Jason

    The application architects should have designed for a single datacenter failure. Yes, this should have been avoided at 365, but where was a 2nd or 4rd application site to pick up the user load for these Web 2.0 companies? They had a failure here too.

  • franklin

    It seems a specific software bug caused multiple generators to fail startup, rather than independent failures. Thus the lesson learned here might be “test, test, and test again.” However the poster’s point remains valid: related or independent failures, it DOES happen. General overview of the bug as follows (copied from the 365 site):

    ====

    The team discovered a setting in the DDEC that was not allowing the component to correctly reset its memory. Erroneous data left in the DDEC’s memory subsequently caused misfiring or engine start failures when the generators were called on to start during the power outage on July 24.

    ====

  • JF CHRISTIN

    As an expert in resilient electrical systems, I am amazed by the architecture implemented in this datacenter. There should be independant power path to the applications, to avoid this well-known domino effect when redundancy is lost. I am also always surprised that design offices are relunctant to go through complete and comprehensive FMEA / reliability analysis : it costs something between 35 – 60 k euros for such type of installation. Corporation like Schneider/SQD has decades of experience and can put a lot of events in the analysis : going htough this type of exercice will not prevent the failures to happen, but their consequences will be far less damaging ! Jesse, where did you get these “public” information ?

  • http://radar.oreilly.com/jesse Jesse Robbins

    From the 365Main website