Failure happens

What an exciting day, as services for hundreds of thousands of users and millions of readers disappeared from the internet. In a stunning but unsurprising event, a repeated power cycling caused by a blown power station disrupted the 365 Main datacenter, causing them to lose all power to two colocation rooms.

I jokingly refer to 365 Main as the “Web 2.0” datacenter; of course, there is nothing Web 2.0 about the datacenter itself. But it does host a remarkable number of such properties, including Craigslist, Technorati, and Red Envelope. Someone could make a lot of VCs cry by taking it out, or so the running joke goes. And ironically enough, this morning 365 Main (together with Red Envelope) put out a press release announcing 2 years of 100% uptime; one may also note that they have now removed the press release from their site, as returns a file not found.

We can draw the conclusion from today’s event that the return to the mainframe world is precarious. Maintaining our own systems is harder than paying someone else to do it; trading on this fact, we expect reliability in exchange for cost. Having entrusted our data, services, and therefore income to these companies, we trust them to keep it safe and available. When this trust is broken by extended downtime, clearly the situation is not a sustainable solution.

But that doesn’t mean that running your own operations is the solution either. A while ago I wrote about disaster recovery with Amazon Web Services, and some of you astutely pointed out that problems can occur if you run your own operations. And indeed they can! And they do! Clearly the need for disaster recovery plans are just as important if you are hosting it yourself. These plans exist on a different continuum, affecting not just operations but also your entire organisation’s response to disasters.

Planning response to disasters cannot be avoided, particular in San Francisco’s position between two quite dangerous fault lines. An earthquake is a question of when, not if. Are the startups ready for this? How long will we expect them to be gone? Possible answers to these questions look grim given a cursory analysis of today’s events. Several of the world’s largest websites went down. None of them were ready for a datacenter outage. None of them had backup datacenters or fail over that worked. None even had a coherent strategy for communicating the situation to the rest of the world.

Google, Yahoo, Amazon and EBay are companies that have invested money in redundancy. Because of this investment they can build systems that survive outages, a significant comparative advantage for these companies. This advantage presents a tradeoff I am willing to make: even if I don’t trust Google with my data’s privacy, I do trust that they will keep my data safe. (Or at least safer than I or most people could.) But the tradeoff doesn’t pay off if the company cannot do that, and my trust in a whole swath of companies dropped completely.

I wrote earlier about the discussion about the Open Source Developer Toolkit and the need for frameworks, tools and patterns that improve scalability. But in light of recent events, perhaps reliable and fault tolerant systems are more important than scalable ones. Events like this make people suddenly realise that mysql replication is not an adequate disaster recovery strategy. Or that tight coupling between the database and the app might cause a bit of a problem when your database might be moved to another city. Or that the memcache you are accessing suddenly is several milliseconds away. There is a small group of people who know this; some may call them jaded and cynical, others may call them experienced. But the vast number of developers and operations people are completely out of their depth here.

I want to welcome Jesse Robbins to Radar. We are kicking off a series of articles exploring the depths of the dark and forgotten world of operations. Operations has too long been hiding in the shadows, treated as the poor cousin to engineering and development. It is time to share our horror stories, experiences and ideas in hopes of collectively pushing our profession to a higher level.

tags: ,

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

  • 365 Main’s initial incident report found that several generators failed to start when grid power failed. This event could lead many Web 2.0 companies to consider redundant backup centers, as most banks and financial institutions do. Financial regulators have made it clear that they prefer to see some distance between the primary and backup centers to ensure that a single “regional event” won’t take out both facilities. The cost/benefit analysis on a backup data center is quite different for a major bank than a Web 2.0 startup, however. As for San Francisco’s vulnerability to an earthquake, that’s been a selling point for data centers in Sacramento for some time now.hoenix.

  • Jason

    I work for a small company and when our server went down we had no backups let alone a plan. Did some digging on the Internet and found this article on disaster planning.
    Business continuity planning.
    I had no idea data recovery was even an option. Hopefully I won’t need their services but in the future I will give CBL a call before I try anything. BTW we got back up and running a couple of days later. Lost some emails and 2 days of productivity. J.

  • In all the various blog posts about this datacenter crash no one seems outraged that their back-up systems are obviously inadequate. Do they ever test? In the last company I worked for with a datacenter we had a UPS the size of a van and a diesel generator the size of a garage and we tested them at least monthly. Given the size of the sites they are hosting I think this is a major issue- if I were one of their customers I’d be reading my SLA and planning my getaway.
    And I’d be looking hard at a rural facility near a dependable (hydro) powersource- wait, that sounds like a Google datacenter…

  • I doubt yesterday’s events at 365 Main would surprise anyone who is experienced in operations. I recall dealing with similar situations as far back as the 70s. And you will find at least a dozen current books on IT disaster recovery on, so this stuff isn’t rocket science. Perhaps operations has been “hiding in the shadows” in your world, but rest assured that operations doesn’t always play second fiddle in enterprises where availability is paramount.

    I think this post simply illustrates that every generation of technologists must relearn “the hard way” the lessons learned by the previous generations.

  • Martin:

    There is no real reason I would be outraged particularly at 365 Main. The outrage is with companies who believe being in one datacenter is sufficient protection against failure.

  • While it sometimes takes an outage of this size to strike fear into the hearts of most ops people, it still remains: a SPOF is a SPOF, no matter how big that “point” is. (Single Point Of Failure)

    For those startups out there whose budgets won’t allow (yet) for full redundancy of their architecture, an easy first step can be just spending the small mount of effort and money needed to serve a quick status/downtime page from multiple locations if you need it.

    The only other thing worse than being bitten by operational failures is not being transparent about it to your users.

    I feel for the guys who had to deal with all of this yesterday.

  • I wanted to share a few thoughts on this issue after reading about it yesterday. Currently, my company hosts its web servers in a Bay Area data center (not 365 Main) and this is a constant issue in the back of my head.

    When I worked at a company that hosted 300+ servers in Santa Clara, we experienced outages from a variety of sources:

    1. Power – we were on the same grid as SJC airport yet our power delivery was such that we experienced 3 outages in 3 years.
    A. One major outage (before I joined operations) was created when a corroded part in the auto-failover switching equipment finally broke and fell across two live connections – the resulting short blew up the equipment and knocked out power to the building.

    2. HVAC – Our HVAC systems suffered an outage when the chilled water return pipe seperated in a six-foot geyser outside the building. Facilities called it a ‘seismically-created event’. Upshot was that we were renting large fans and portable HVAC units to vent all the heat that was no longer being handled by our our onsite HVAC system. Managing this system was a long-running problem that pitted Operations / Engineering (two business units that shared data center space) with Facilities and the outsourced HVAC contractor.

    3. Telco lines were damaged / vulnerable to construction crews remodeling the campus. Facilities had brought in a crew of subcontracted subcontractors (figure that one out…) and we had a pool going as to what day it’d be when they finally dug across our OC3 lines and took the building’s communication out. As it stands, they only managed to damage the buried phone line that serviced the ATM in the cafe – comparatively minor – but it was a real issue that we had our fingers crossed daily over.

    Operations as a managed service makes a lot of sense – Web 2.0 isn’t in the same nuts-and-bolts business as Operations; let the experts do their job and hand you a bill. If you happen to be the one directly working with your hosted server company, do everything you can to maintain a healthy relationship with them. Ask specific questions and expect answers – don’t be satisfied by what their sales team or their handouts say. Ask to speak with the on-floor techs and see what they say.

  • “We are kicking off a series of articles exploring the depths of the dark and forgotten world of operations.”