Failure Happens: An SLA is just a contract & Data Centers are single points of failure too

Rackspace just had a Data Center failure as Scott Beale, TechCrunch, and Valleywag are reporting. Rackspace has been one of the most reliable infrastructure providers for many years, and it has done much to live up to the slogan of providing “fanatical support”. Unfortunately, many people misinterpret its “zero downtime network” marketing as a promise that it will not fail.

Rackspace does not promise that its system will not fail, instead, it establishes a Service Level Agreement (SLA) which specifies how customers will be compensated when failure happens. The Rackspace SLA is actually one of the clearest in the industry:

Rackspace’s SLA is a contract between you, the customer, and Rackspace. It defines the terms of our responsibility and the money back guarantees if our responsibilities are not met. We want our customers to feel at ease with their decision to move their site to Rackspace, and knowing that Rackspace takes your site’s uptime as seriously as you do is imperative. […]

Rackspace guarantees that the critical infrastructure systems will be available 100% of the time in a given month, excluding scheduled maintenance. Critical infrastructure includes functioning of all power and HVAC infrastructure including UPSs, PDUs and cabling, but does not include the power supplies on customers’ servers. Infrastructure downtime exists when a particular server is shut down due to power or heat problems and is measured from the time the trouble ticket is opened to the time the problem is resolved and the server is powered back on.

Rackspace Guarantee: Upon experiencing downtime, Rackspace will credit the customer 5% of the monthly fee for each 30 minutes of downtime (up to 100% of customer’s monthly fee for the affected server).

Please remember that Data Centers are single points of failure too. (see: Artur Bergman’s post and My followup after 365 Main outage)

Incidentally, the Recovery Oriented Computing project is an exceptional resource for those interested in building resilient systems:

The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services. In a significant divergence from traditional fault-tolerance approaches, ROC emphasizes recovery from failures rather than failure-avoidance. This philosophy is motivated by the observation that even the most robust systems still occasionally encounter failures due to human operator error, transient or permanent hardware failure, and software anomalies resulting from “Heisenbugs” or software aging.

The ROC approach takes the following three assumptions as its basic tenets:

* failure rates of both software and hardware are non-negligible and increasing

* systems cannot be completely modeled for reliability analysis, and thus their failure modes cannot be predicted in advance

* human error by system operators and during system maintenance is a major source of system failures

These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

Update: “Rackspace outage was third in two days” (Valleywag), “Truck Crash Knocks Rackspace Offline” (Data Center Knowledge)

Technorati Tags: downtime, facilities, failurehappens, infrastructure, operations, outages