Amazon EC2 and S3 disaster planning

In my post Amazon Web Services and the lack of a SLA, I asked the following question.

So, if your company is based on AWS: What does your disaster recovery plan look like? How do you react if Amazon goes down or if Amazon decides to shut down AWS?

I asked for what your tradeoffs and disaster recovery plans were. What would you do if Amazon Web Services decided to shut down abruptly, or even with a 30 day notice. The question I got answered was why you use AWS and how the alternatives also have problems.

Perhaps I used the wrong acronym when I used SLA. People have a conception that SLA is something heavy handed that rarely is useful. Now I agree that most SLAs I have ever seen are completely useless, but they do give you an idea how serious the service provider takes your custom. The quote from the T&C is effectively an SLA, one that promises absolutely nothing.

I have been doing complex operations work for the last 7 years. From financial exchange infrastructure to two of the largest websites around. Trust me, I know the pain. Waiting for that fsck to finish that will take 70 hours. Waking up in the middle of the night because someone committed a faulty configuration to cfengine.

I am aware of the risks of my infrastructure setup, I can calculate using the MTBF of every component how likely a catastrophic failure is. Or non catastrophic downtime. I can analyze my situation thoroughly and then make an educated business decision on what level of investment I need. Do I need 2 or 3 copies of my data? Do I need multiple datacenters? What happens if a datacenter fails (as an example, recovering from a failed datacenter that was hosting a petabyte of data over a 10 Gb/sec link will take more than a year). How long can I tolerate running with less redundancy? On the SLA side, even if I don’t have I have a hard core SLA, I do have contract with my datacenter that forces them to give me notice if they plan to evict me. The investment needed is large, the operational cost is a nightmare. Don’t get me wrong, I would love to let someone else do that for me.

No one said they analyzed the risks that way. Looking Jeff in the eyes is not an analysis. I am sure Jeff has trustworthy eyes, but he could die tomorrow and probably not of an eye disease. Incidentally, risk of death in key positions is something most organisations fail to put in their disaster plans. Wall Street is grumbling that Amazon’s margins are down, and have questioned the investments into non core business. I am convinced Amazon, right now, has a full commitment to providing AWS. But there is no certainty that will be true forever. From an operations point of view, I am not a believer, I don’t use faith as my guding light, I trust hard real numbers.

If I was doing a startup, AWS would be perfect to bootstrap and try my idea out. If they go down I haven’t lost much since I am in early stages. If I was a larger company I would use them as a backup and perhaps as an emergency scaling service. But if they go down, my customers won’t blame Amazon, they will blame me. Once you have paying customers, the cost of going down rises rapidly.

Don, you say that Amazon keeps secondary and tertiary copies for you. How do you know this? Has Amazon divulged this for you, because I couldn’t find that data anywhere. I think it would be great if they let me choose my level of redundancy and charge me accordingly. And given the hypothetical situations where an S3 item became amazingly popular, and given that Amazon doesn’t have infinite bandwidth, do you seriously think they would let that impair the operations of Amazon.com? Since you are a private company with no plans to go public, you aren’t responsible to shareholders that will sue you if you ruin the company. I understand that makes the decision much simpler.

I think there are a three things that Amazon could do to improve the situation dramatically:

  • Change the T&C to at least promise to give paying customers a notice of a certain amount of days if they choose to shut the service down.
  • Publish their current uptime and availability to their customers.
  • Show you how many copies of a file exists, and how quickly a file uploaded to them becomes redundant.

Now I hope someone will compete with Amazon, and I think Microsoft and Sun are two likely candidates. Google’s main competitive advantage right now seems to be their ability to scale systems massively very quickly. Giving that away as a service doesn’t make much sense.

So, I asked about disaster recovery, I have yet heard anything about. The question is still open.

tags: ,

Get the O’Reilly Systems Engineering and Operations Newsletter

Get weekly insight from industry insiders—plus exclusive content, offers, and more on the topics of systems engineering and operations.