Some thoughts on the recent AWS outage (and outages in general)
1. Your most important data should exist somewhere outside AWS.
At the very least it should exist in another region. While most people have bought in the AWS model in a fairly big way I still think we need to be prepared for a time where AWS completely disappears. For most people I suppose this means that if your database is not being replicated out of AWS then data loss is something you may have to get comfortable with one day.
2. Power outages will happen
A lot of people are dumping on AWS right now for not having decent backup power. The thing is that it’s really hard for a hosting provider to respond to every type event which might cause power failure. So, you’re doing yourself a disservice if a power outage is not a scenario you can recover from. History has proven that the perfect combination of failures will happen from time to time. Even Rackspace is not immune to these things. Does anyone remember the great Texas Trucker incident of ‘07?
3. EBS is great, except when it’s not
Performance issues aside, EBS is mostly awesome. After AWS restored power yesterday our EBS volumes were frozen and inaccessible. That is a pretty special kind of frustration to experience. All your data is there but you’re waiting on a set of vague and ill-defined circumstances to occur before you can access it again. The correct response to this situation is not “wait”. Since there is an indeterminate amount before AWS gives you back your data you may as well start some sort of recovery process. In our case the recovery process finished first.
4. Plan B, C, etc
The more “writes” your system handles the less valuable EBS snapshots are as a recovery tool. Data can get stale really quickly. When Amazon restored access to EBS snapshots we were able to recover our most recent database backup and put it on a new server. However our data was around 40 minutes old. So, I then had to go dumpster diving through our offsite mysql replica and extract the last 40 minutes of data from the bin logs. Success!
AWS was still busy restoring EBS volumes for many hours after we were able to restore service so again, I think the lesson here is don’t wait. Execute a few recovery plans and see which wins.
5. RDS is great, except when it’s not
All I will say is that the ability to replicate data across regions or out of AWS is a critical missing feature from this service. I still really like RDS in theory, I go through phases of being quite enthused by it.
6. Finally, make sure you can talk to each other
Instant message services and IRC are pretty crap in emergency scenarios. If you have the option of FaceTime/Skype you will resolve things much quicker! Yesterday I had a FaceTime call going with Chippie and I have no doubt this shortened the time to recovery a lot.