Failure Happens: A summary of the power outage at 365 Main

Datacenter provider 365 Main released their initial report from Tuesday’s power failure which affected Craigslist, Technorati, Yelp, TypePad, LiveJournal, Vox, and others. This outage is an excellent example of complex systems failure, and so I’ll be using it as the basis for my next few posts on Operations. This is my own analysis using publicly available data.

The 365main site does not have a typical battery backup system. Instead they rely on Continuous Power Supplies (CPS) which use a flywheel driven alternator to generate electricity.

The flywheel is connected to both a large diesel motor and an electric motor which runs on utility power. The flywheel is normally turned by the electric motor, and stores enough kinetic energy to power the alternator for up to 15 seconds. When utility power fails the diesel motor is supposed to start in under 5 seconds, well before the flywheel’s kinetic energy is exhausted, providing uninterrupted electrical power.

The advantage of a CPS over a battery-based system is that the power going to the datacenter is decoupled from the utility power. This eliminates the complex electrical switching required from most battery-based systems, making many CPS systems simpler and sometimes more reliable.

In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.

365main-power-systems-jesse-robbins-2.png

What’s most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being “nearly impossible”. 

Unfortunately, the impossible often becomes reality.

To put it another way… Failure Happens.

Next week we’ll dive into building resilient websites and take a look at a few of the sites that went down. Artur and I are both excited to be writing about this, and welcome your comments, suggestions, and war stories!

tags: , , , , ,

Get the O’Reilly Systems Engineering and Operations Newsletter

Get weekly insight from industry insiders—plus exclusive content, offers, and more on the topics of systems engineering and operations.