Amazon EC2 and S3 disaster planning

In my post Amazon Web Services and the lack of a SLA, I asked the following question.

So, if your company is based on AWS: What does your disaster recovery plan look like? How do you react if Amazon goes down or if Amazon decides to shut down AWS?

I asked for what your tradeoffs and disaster recovery plans were. What would you do if Amazon Web Services decided to shut down abruptly, or even with a 30 day notice. The question I got answered was why you use AWS and how the alternatives also have problems.

Perhaps I used the wrong acronym when I used SLA. People have a conception that SLA is something heavy handed that rarely is useful. Now I agree that most SLAs I have ever seen are completely useless, but they do give you an idea how serious the service provider takes your custom. The quote from the T&C is effectively an SLA, one that promises absolutely nothing.

I have been doing complex operations work for the last 7 years. From financial exchange infrastructure to two of the largest websites around. Trust me, I know the pain. Waiting for that fsck to finish that will take 70 hours. Waking up in the middle of the night because someone committed a faulty configuration to cfengine.

I am aware of the risks of my infrastructure setup, I can calculate using the MTBF of every component how likely a catastrophic failure is. Or non catastrophic downtime. I can analyze my situation thoroughly and then make an educated business decision on what level of investment I need. Do I need 2 or 3 copies of my data? Do I need multiple datacenters? What happens if a datacenter fails (as an example, recovering from a failed datacenter that was hosting a petabyte of data over a 10 Gb/sec link will take more than a year). How long can I tolerate running with less redundancy? On the SLA side, even if I don’t have I have a hard core SLA, I do have contract with my datacenter that forces them to give me notice if they plan to evict me. The investment needed is large, the operational cost is a nightmare. Don’t get me wrong, I would love to let someone else do that for me.

No one said they analyzed the risks that way. Looking Jeff in the eyes is not an analysis. I am sure Jeff has trustworthy eyes, but he could die tomorrow and probably not of an eye disease. Incidentally, risk of death in key positions is something most organisations fail to put in their disaster plans. Wall Street is grumbling that Amazon’s margins are down, and have questioned the investments into non core business. I am convinced Amazon, right now, has a full commitment to providing AWS. But there is no certainty that will be true forever. From an operations point of view, I am not a believer, I don’t use faith as my guding light, I trust hard real numbers.

If I was doing a startup, AWS would be perfect to bootstrap and try my idea out. If they go down I haven’t lost much since I am in early stages. If I was a larger company I would use them as a backup and perhaps as an emergency scaling service. But if they go down, my customers won’t blame Amazon, they will blame me. Once you have paying customers, the cost of going down rises rapidly.

Don, you say that Amazon keeps secondary and tertiary copies for you. How do you know this? Has Amazon divulged this for you, because I couldn’t find that data anywhere. I think it would be great if they let me choose my level of redundancy and charge me accordingly. And given the hypothetical situations where an S3 item became amazingly popular, and given that Amazon doesn’t have infinite bandwidth, do you seriously think they would let that impair the operations of Amazon.com? Since you are a private company with no plans to go public, you aren’t responsible to shareholders that will sue you if you ruin the company. I understand that makes the decision much simpler.

I think there are a three things that Amazon could do to improve the situation dramatically:

  • Change the T&C to at least promise to give paying customers a notice of a certain amount of days if they choose to shut the service down.
  • Publish their current uptime and availability to their customers.
  • Show you how many copies of a file exists, and how quickly a file uploaded to them becomes redundant.

Now I hope someone will compete with Amazon, and I think Microsoft and Sun are two likely candidates. Google’s main competitive advantage right now seems to be their ability to scale systems massively very quickly. Giving that away as a service doesn’t make much sense.

So, I asked about disaster recovery, I have yet heard anything about. The question is still open.

tags: ,

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

  • Thomas Lord

    If Amazon S3 makes an attractive target for enough start-up efforts, then some of those efforts will win big. As they go through the process of winning big either they will migrate away from Amazon, or they will help shape Amazon’s product line in response to real-world needs.

    It is a very smart play by Amazon to explore this obvious new market empirically, precisely because it is both within their core competence and secondary to their current business model. Mr. B. is hecka smart and he’ll win to the degree the people running S3 aer flexible, patient, and responsive to changing demand.

    -t

  • http://kitchensoap.com john allspaw

    Thanks for posting these, Artur. I have yet to see anyone ask these plainly obvious questions of EC2 and S3, and I still remain dubious on many levels of the virtualization/cloud concepts that will apply to the growing/large sites.

  • http://www.ideasystm.ca Chris Fizik

    a client of mine, after a bad RAID array crash that we couldn’t recover, turned to amazon as a scalable backup/emergency solution. So far it seems to be a very viable solution — but very interesting point about if S3 goes down….

  • http://www.kinlane.com Kin Lane

    Every solution should have a backup or plan B. We put a lot of trust into Amazon Web Services and think that because they are so big that they couldn’t possibly go away.

    It is safe to say they won’t go away without notice, but there should always be a plan B.

  • http://ryan-technorabble.blogspot.com Ryan Baker

    I’m not sure what comparative agreements your looking at, but I went through a process where there was a great deal of undue concern about Amazon agreements.

    Many SLA’s don’t guarantee any level of service, they guarantee the provider will charge you less, or give free service, if they screw up. Unless they’ll force the provider to pay you, you’re not guaranteed any service at all.

    Amazon’s lack of SLA is mostly due to the lack of any quid-pro-quo agreement from the user. You pay for what you use, not a penny more, and not a minute longer than you want. There’s no cancellation fee.

    Now I can imagine you might be negotiating contracts and SLA’s that say a lot more than the average SLA, in which case you are guaranteed something, up to the providers ability to pay and the ability of the legal system to enforce the agreement in a cost effective manner.

  • http://www.simson.net/ Simson Garfinkel

    I think that the word that you are looking for is not SLA or availability, but “durability.” Amazon does promise that the data will be available (although they don’t back this promise up with anything), but they do not make any enforceable long-term commitments.

  • http://www.3tera.com/hotcluster.html Bert Armijo

    Excellent post.

    Of course, there is only one viable answer to your question; there must be competitive services that can be leveraged for redundancy. This is how web companies deal with existing service providers today. No doubt, and ready who’s ever been responsible for selecting a colo provider has checked to ensure they pull power from multiple grids and have generator backup as well. Likewise, for connectivity. Even so, larger firms almost always still have presence in multiple colo facilities. I see no reason to believe this should be different for utility computing. In development and beta, a single provider may suffice. In production, however, investors are unlikely to accept that “the vendor’s got us covered.”

    Early last year, as we took our AppLogic system into beta, 3tera faced this challenge. Instead of creating our own service, we chose to offer service in partnership with existing hosting providers, by enabling them to offer utility computing services. This model has proven successful. More than once users with high uptime requirements have availed themselves of the ability to work with multiple providers.

  • http://chxo.com/ Chris Snyder

    From an engineering standpoint, there’s not much magic involved in EC2. If they stop the service, or you don’t like how they run it, then you build your own Xen servers and migrate your amis to those.

    Will you suffer for a while without the nifty management interface? Sure. Could you build your own using Ruby or PHP in a few days? Yep.

    S3 would be harder to replace, but it looks enough like WebDAV that you could convert your storage calls overnight. Getting all your data out might take longer than 30 days, I suggest being ready to parallelize that.

    The genius of EC2 is that, beyond the Xen kernel, it’s your code, not Amazon’s. That makes it pretty easy to migrate to a similar provider if AWS goes south. No standard Application Service Provider (Google cough!) can promise that, and yet business are expected to build on those platforms. AWS is a much safer bet.

  • http://www.atnan.com Nathan de Vries

    @Chris Snyder: Thankfully, many of us build software such that we could switch to our own self-hosted services relatively easily. Unfortunately, that’s not the case for many others.

    But you’re right, building your own S3 / EC2 / SQS solution wouldn’t be too difficult using Sun X4500s + WebDAV / Xen / ActiveMQ.

    In the absence of SLAs for AWS, I guess our only option is to ensure that everything we use has an alternative, and everything we build is capable of using it.

  • http://twopieceset.blogspot.com Nick Gerner

    This is a great post. Werner Vogels (CTO for Amazon) just presented at the Seattle Conference on Scalability (from Google) to speak about the systems they’re using at Amazon. Presumably some of those systems backs EC2 and S3, or will very soon.

    He got a lot of questions about businesses, serious businesses that NEED SLAs as you describe, using S3 and EC2. He said (I’m paraphrasing), “Are we working on this? Yes. Do we have a timeline to address those concerns? Yes. Will I tell you what that timeline is? No.”

    I specifically asked him if he wants AWS to be positioned as (one of) the tech platform provider for web scale businesses. He said absolutely they do. And he knows that the rest of us want guarantees and data to back that platform up. So keep an eye out…

  • http://www.enomalylabs.com Reuven Cohen

    We just came out with a S3 network block device.
    ElasticDrive provides the unique capability of writing backups to a local disk and a remote Amazon S3 storage system simultaneously. The very same data could be available on-line for quick restores from a disk and off-site. Check out http://www.elasticdrive.com

  • http://www.maxHeap.com Sumit

    A very nice post. Disaster recovery plan has to be in place. It has to be well thought of and documented. I think its not just the case with AWS. Its with every damn thing on earth. Be it AWS or WTC falling down, risk management has to be done.

  • Mike

    You might want to do a tad more research on your suggestions: “Change the T&C to at least promise to give paying customers a notice of a certain amount of days if they choose to shut the service down.”

    Amazon has already done this, and now provides 60 days notice of shutdown:

    http://www.amazon.com/gp/browse.html?node=3440661#3


    3.3.2. Paid Services (other than Amazon FPS). We may suspend your right and license to use any or all Paid Services (and any associated Amazon Properties) other than Amazon FPS, or terminate this Agreement in its entirety (and, accordingly, cease providing all Services to you), for any reason or for no reason, at our discretion at any time by providing you sixty (60) days’ advance notice in accordance with the notice provisions set forth in Section 15 below.

  • http://cloud.8kmiles.com/ Harish Ganesan

    We feel AWS has evolved and improved a lot in the last 3 years in terms of underlying technology and their offerings.Not only Startups, even bigger companies have started using AWS in their overall Disaster recovery strategy from last year. On the other side, we have to understand not all applications are suitable to leverage AWS in their Disaster Recovery strategy. Their SLA’s might still be a concern for some type of applications and companies.

    We have published an blog article exploring the various Disaster Recovery Architectures that we can design for our infrastructure using AWS.

    http://cloudblog.8kmiles.com/2011/03/08/architecture-blueprints-disaster-recovery-using-aws/

  • http://www.disasterrecovery.com Arnold Villeneuve

    Great article. And yes AWS is a great component to add into the mix to get your business and IT department back up and running. That’s why KingsBridge Systems has implemented our Phoenix BIA/DRP/BCP planning solution on the AWS cloud. What better place to have your DRP/BCP plans than outside of your on infrastructure in a secure and highly redundant location that AWS?

    Check out http://www.disasterrecovery.com for more information about Phoenix for SharePoint AWS Edition.

    Arnold