Mon

Nov 12
2007

Jesse Robbins

Jesse Robbins

Failure Happens: An SLA is just a contract & Data Centers are single points of failure too

Rackspace just had a Data Center failure as Scott Beale, TechCrunch, and Valleywag are reporting. Rackspace has been one of the most reliable infrastructure providers for many years, and it has done much to live up to the slogan of providing "fanatical support". Unfortunately, many people misinterpret its "zero downtime network" marketing as a promise that it will not fail.

Rackspace does not promise that its system will not fail, instead, it establishes a Service Level Agreement (SLA) which specifies how customers will be compensated when failure happens. The Rackspace SLA is actually one of the clearest in the industry:

Rackspace's SLA is a contract between you, the customer, and Rackspace. It defines the terms of our responsibility and the money back guarantees if our responsibilities are not met. We want our customers to feel at ease with their decision to move their site to Rackspace, and knowing that Rackspace takes your site's uptime as seriously as you do is imperative. [...]

Rackspace guarantees that the critical infrastructure systems will be available 100% of the time in a given month, excluding scheduled maintenance. Critical infrastructure includes functioning of all power and HVAC infrastructure including UPSs, PDUs and cabling, but does not include the power supplies on customers' servers. Infrastructure downtime exists when a particular server is shut down due to power or heat problems and is measured from the time the trouble ticket is opened to the time the problem is resolved and the server is powered back on.

Rackspace Guarantee: Upon experiencing downtime, Rackspace will credit the customer 5% of the monthly fee for each 30 minutes of downtime (up to 100% of customer's monthly fee for the affected server).

Please remember that Data Centers are single points of failure too. (see: Artur Bergman's post and My followup after 365 Main outage)


Incidentally, the Recovery Oriented Computing project is an exceptional resource for those interested in building resilient systems:

The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services. In a significant divergence from traditional fault-tolerance approaches, ROC emphasizes recovery from failures rather than failure-avoidance. This philosophy is motivated by the observation that even the most robust systems still occasionally encounter failures due to human operator error, transient or permanent hardware failure, and software anomalies resulting from "Heisenbugs" or software aging.

The ROC approach takes the following three assumptions as its basic tenets:
* failure rates of both software and hardware are non-negligible and increasing
* systems cannot be completely modeled for reliability analysis, and thus their failure modes cannot be predicted in advance
* human error by system operators and during system maintenance is a major source of system failures
These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

Update: "Rackspace outage was third in two days" (Valleywag), "Truck Crash Knocks Rackspace Offline" (Data Center Knowledge)

Technorati Tags: , , , , ,


tags: operations, web 2.0  | comments: 9   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6041

Comments: 9

  Graham Weston, Chairman of Rackspace [11.12.07 11:59 PM]

You make some good points about SLA's. Your comment that our SLA is the "clearest in the industry" made me grin from ear to ear (after this tough day, I needed a lift). We worked hard to make the Rackspace SLA easy to understand and simple. We wanted it in clear English since because it is our written promise....our pledge. We pledge to pursue 100% uptime. We do not consider a minute of downtime acceptable.

Today, we broke our promise to a lot of customers in our Dallas datacenter. We will make it right with or without this SLA.

  Marc Hedlund [11.13.07 06:35 AM]

In the interest of full disclosure (since I write for Radar), my company, Wesabe, is hosted at Rackspace and was taken offline by this event.

I was pretty happy with Rackspace's communication about the event. I was unhappy to have to discover the problem myself, and not to have heard from Rackspace directly when the problem occurred. Once I called, though, I got good and honest information from the NOC contact and good follow-up calls and emails as the night progressed.

Jesse is right that treating any single datacenter as bullet- (or truck-) proof is a mistake.

  Raz [11.13.07 06:55 AM]

Graham Weston, why do you consider your self as a part of an angreement, and not only providing it?

  Chris Scott [11.13.07 08:21 AM]

I was also down for 3 hours last night.

They shut down our servers due to the heat at the datacenter. From what we understand:
In the second incident at approximately 6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center's chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim. This repeated cycling of the chillers resulted in increasing temperatures within the data center. As a precautionary measure we decided to take some customers' servers offline. These servers are now back up, as are the chillers.

So it seems as the redudant systems worked. With power and all, but the chillers failed when they had to cylce them multiple times because of the accident victim.

Although all of our servers and our imaged suffered, I can't say enough good things about rackspace and what they've done for us. I mean, with all my experiences with datacenters (esp The Planet) they handled everything as best as I can ask for. They've gone above and beyond with any support request me and my team have had and they are simply... Fanatitcal as much as I can expect them to be.

  Michael Marano [11.13.07 10:31 AM]

I've had major power failures at other major hosting facilities before. Never have I had a problem of this magnitude handled with as much grace and transparency as Rackspace has done. Sucks to have a failure, it's awesome to have a great response.

  Dennis Linnell [11.13.07 12:21 PM]

Rackspace certainly offers a clear SLA, but so what? Any SLA is nothing more than a marketing gimmick. If Rackspace didn't offer that SLA, would you buy from them? I certainly would, because I have confidence they'll do a great job and treat me the way I'd like to be treated.

Here I assert several more reasons why the industry standard SLA is artful and bogus.

But let's ask those of you who suffered downtime in the latest unfortunate incident: Did Rackspace's generous SLA fully compensate you for all costs resulting from the outage?

  Michael Sparks [11.13.07 12:35 PM]

I think the "recovery oriented computing" reflects something anyone who's run a network service knows - all software breaks and usually at the worst possible time. If you design your system with that in mind your life is almost always a lot better. Especially if you remember recovery can take serious time.

To call it a novel approach strikes me as rather bizarre! Maybe it's new to academia - people on the ground have been doing this for years. (I learnt it (from others) when I started running network services 10 years ago...)

To be fair, the about page does say: These assumptions ... are all strongly supported by field evidence from modern production Internet service environments. ... Practical experience and anecdotal evidence speak to the fact that many real-world failures are created or compounded by non-functional repair or warning systems (which is true, broken watchers are really annoying, and reliability of watchmen systems is something most good network admins look at) and In some cases, proactively restarting components before they fail can improve overall availability; most clustered Internet services already do this

All that said, the more projects there are like this the better. That said, there is one word I'm surprised to see missing - triage - meaning a body of common thinking styles may be missing. Lots of real world network system trouble shooting ends up using lots of terminology borrowed from the medical profession casualty/emergency room scenarios. You triage a situation to decide what needs work first by looking at the systems, you produce patches to bandage a system, and so on. Look at quiet systems first (since they may need the most help). Once you've got the system (as a whole) past critical you start to diagnose the systemic problem.

This also explains why designing for clustering is a really good idea and why having a cluster formed of heterogenous but protocol compatible software can be a good idea. If you do, you can use tools like Linux Virtual Servers - LVS - (or buy a layer 4 switch) to load balance. Aside from anything else, it allows you to pull a system out of operation without affecting the service. If you use cross site load balancing (which you can do with LVS) your service can happily survive data centre death. (When I was at the JWCS many moons ago we'd switched over to using LVS for balancing and had a powercable severed at one site. Due to using LVS at multiple locations and cross site balancing, the service kept on running.

  Ian Rae [11.14.07 09:13 AM]

Well said: "Please remember that Data Centers are single points of failure too"

The biggest challenge with multi-site architectures is consistent replication/synchronization of the data tier, but today's software is quite flexible and sophisticated when it comes to multisite replication, and the costs of high performance intersite links have plummeted. The complexity is still high so folks have to go to the trouble of designing the application for high availability if they really care about resiliency in the face of geographic or political failures.

For real scalability some would argue we need to change how our applications store, process and manage data. See discussion of Amazon's Dynamo here - http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html But scalability warrants a separate though related discussion.

  Damon Edwards [11.15.07 04:15 PM]

The dirty little secret of most hosting and managed services companies is that their tooling isn't much better than what you would cobble together on your own (and in many cases leading web operations like Google and E*TRADE are light-years beyond your average hosting or managed services provider).

Outages caused by human mistakes are inevitable. Change is going to happen somewhere... and no matter how redundant you make your system, someone will discover a new way to break it.

The important thing is to be able to quickly recover to full operating capacity. This can only be done if you have deployed an integrated toolset that can automatically and reliably redeploy your entire infrastructure to a "last known good" state at the literal push of a button.

Very few organizations can honestly make that claim today. Usually they've got things 80% automated and rely on good old fashion brute force for the rest.

Don't think its possible to get to 100% automation? These open source projects are making it a reality:

Puppet:
http://puppet.reductivelabs.com

ControlTier (shameless plug)
http://open.controltier.com

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU