Lessons from Rackspace’s downtime
Last night Rackspace Cloud had some downtime. Reading post-mortems is always instructive, so let’s see what we can learn from Rackspace.
It sounds like this downtime was caused by a power issue:
We were testing phase rotation on a Power Distribution Unit (PDU) when a short occurred and caused us to lose the PDUs behind this Cluster. The phase rotation allows us to verify synchronization of power between primary and secondary sources.
[…] The PDUs were down for a total of about 5 minutes.
This isn’t the first time Rackspace has had power-related downtime (there were a couple incidents last year), but power-related downtime is in no way something that only affects Rackspace. Indeed, a power outage in San Francisco in 2007 knocked out the 365 Main data center, taking down sites as big as Yelp and Craigslist.
Again, just to be clear: I’m not criticizing Rackspace in any way. They’re a highly regarded hosting provider, and deservedly so. From my reading, they didn’t do a single thing wrong last night — it was just one of those "shit happens" events. Indeed, they’re to be commended for providing a good post-mortem and great real-time updates as they worked through the problems.
Clearly, the lesson here is to…
Watch your power!
That awesome load-balanced, redundant, no-single-point-of-failure stack you’ve built? Yeah, doesn’t do you much good when the lights go out. In my experience, the worst, most sustained downtime has always been caused by power issues.
Because of that, I take power issues very seriously. I make a point of asking hosting providers and data centers about their power arrangements. I try to make sure my data centers have redundant power to each rack, and generators that are tested regularly. But I also assume that losing power is a matter of "when," not "if." That means making sure I’ve got good external communication channels to keep clients up-to-date, and it means making sure I can recover quickly after the lights come back on.
Speaking of lights coming back on… take a look at Rackspace’s play-by-play last night:
UPDATE: As of 1:15am CST, Rackspace Cloud engineers have identified that a Rackspace data center maintenance issue has caused residual intermittent power connectivity to our Cloud Sites system. […]
UPDATE: As of 1:30am CST, service has been restored to the majority of our technology clusters in our WC2 cluster. […]
UPDATE: As of 2:00am CST, we are still trouble shooting […]
UPDATE: As of 2:30am CST, […]
UPDATE: As of 4:20am CST, we are seeing a few residual issues […]
UPDATE: As of 4:27am CST, engineers corrected the outstanding residual issues and have once again fully restored service.
Hang on - I thought the post-mortem said that the PDUs were only down for 5 minutes? How’d a five minute power outage turn into over 3 hours of downtime (for some customers)?
This leads us to a corollary to this lesson:
Reboots must be automatic
If you accept the argument that power-related downtime is inevitable, then it stands to reason that you’ve got to be absolutely sure that your machines boot cleanly, and that all required services come up when they do.
I know, I know, we all like to brag about our uptime. Unfortunately, the only way to know that your machines boot correctly is to try it. It’s time to make a high uptime a mark of shame, not a badge of honor.
Good data centers fall over to backup power every week or so; it’s a similar good practice to regularly reboot machines. Forcing a sudden reboot will cleanly test your fallover solution — you’ve got fallover mechanisms for your machines, right? Plus, you’ll know that a five minute power outage won’t turn into 3 hours of debugging.
Finally, one last corollary:
Five nines is impossible
Really. It’s just not going to happen. If your SLA guarantees five nines you might as well set aside the penalty money from day one; you’ll be paying out sooner or later.
Think about it: 99.999% uptime translates to only 26 seconds of downtime in a month. Even if your machines boot perfectly ever time, the’ll never recover from even a split second of power loss. No cluster I’ve seen can boot in under 26 seconds.
You will lose power, so be prepared.