Failure happens

What an exciting day, as services for hundreds of thousands of users and millions of readers disappeared from the internet. In a stunning but unsurprising event, a repeated power cycling caused by a blown power station disrupted the 365 Main datacenter, causing them to lose all power to two colocation rooms.

I jokingly refer to 365 Main as the “Web 2.0” datacenter; of course, there is nothing Web 2.0 about the datacenter itself. But it does host a remarkable number of such properties, including Craigslist, Technorati, and Red Envelope. Someone could make a lot of VCs cry by taking it out, or so the running joke goes. And ironically enough, this morning 365 Main (together with Red Envelope) put out a press release announcing 2 years of 100% uptime; one may also note that they have now removed the press release from their site, as http://www.365main.com/press_releases/pr_7_24_07_red_envelope.html returns a file not found.

We can draw the conclusion from today’s event that the return to the mainframe world is precarious. Maintaining our own systems is harder than paying someone else to do it; trading on this fact, we expect reliability in exchange for cost. Having entrusted our data, services, and therefore income to these companies, we trust them to keep it safe and available. When this trust is broken by extended downtime, clearly the situation is not a sustainable solution.

But that doesn’t mean that running your own operations is the solution either. A while ago I wrote about disaster recovery with Amazon Web Services, and some of you astutely pointed out that problems can occur if you run your own operations. And indeed they can! And they do! Clearly the need for disaster recovery plans are just as important if you are hosting it yourself. These plans exist on a different continuum, affecting not just operations but also your entire organisation’s response to disasters.

Planning response to disasters cannot be avoided, particular in San Francisco’s position between two quite dangerous fault lines. An earthquake is a question of when, not if. Are the startups ready for this? How long will we expect them to be gone? Possible answers to these questions look grim given a cursory analysis of today’s events. Several of the world’s largest websites went down. None of them were ready for a datacenter outage. None of them had backup datacenters or fail over that worked. None even had a coherent strategy for communicating the situation to the rest of the world.

Google, Yahoo, Amazon and EBay are companies that have invested money in redundancy. Because of this investment they can build systems that survive outages, a significant comparative advantage for these companies. This advantage presents a tradeoff I am willing to make: even if I don’t trust Google with my data’s privacy, I do trust that they will keep my data safe. (Or at least safer than I or most people could.) But the tradeoff doesn’t pay off if the company cannot do that, and my trust in a whole swath of companies dropped completely.

I wrote earlier about the discussion about the Open Source Developer Toolkit and the need for frameworks, tools and patterns that improve scalability. But in light of recent events, perhaps reliable and fault tolerant systems are more important than scalable ones. Events like this make people suddenly realise that mysql replication is not an adequate disaster recovery strategy. Or that tight coupling between the database and the app might cause a bit of a problem when your database might be moved to another city. Or that the memcache you are accessing suddenly is several milliseconds away. There is a small group of people who know this; some may call them jaded and cynical, others may call them experienced. But the vast number of developers and operations people are completely out of their depth here.

I want to welcome Jesse Robbins to Radar. We are kicking off a series of articles exploring the depths of the dark and forgotten world of operations. Operations has too long been hiding in the shadows, treated as the poor cousin to engineering and development. It is time to share our horror stories, experiences and ideas in hopes of collectively pushing our profession to a higher level.

Failure happens

Get the O’Reilly Systems Engineering and Operations Newsletter