Last night Rackspace Cloud had some downtime. Reading post-mortems is always instructive, so let’s see what we can learn from Rackspace.
It sounds like this downtime was caused by a power issue:
We were testing phase rotation on a Power Distribution Unit (PDU) when a short occurred and caused us to lose the PDUs behind this Cluster. The phase rotation allows us to verify synchronization of power between primary and secondary sources.
[…] The PDUs were down for a total of about 5 minutes.
This isn’t the first time Rackspace has had power-related downtime (there were a couple incidents last year), but power-related downtime is in no way something that only affects Rackspace. Indeed, a power outage in San Francisco in 2007 knocked out the 365 Main data center, taking down sites as big as Yelp and Craigslist.
Again, just to be clear: I’m not criticizing Rackspace in any way. They’re a highly regarded hosting provider, and deservedly so. From my reading, they didn’t do a single thing wrong last night — it was just one of those “shit happens” events. Indeed, they’re to be commended for providing a good post-mortem and great real-time updates as they worked through the problems.
Clearly, the lesson here is to…
Watch your power!
That awesome load-balanced, redundant, no-single-point-of-failure stack you’ve built? Yeah, doesn’t do you much good when the lights go out. In my experience, the worst, most sustained downtime has always been caused by power issues.
Because of that, I take power issues very seriously. I make a point of asking hosting providers and data centers about their power arrangements. I try to make sure my data centers have redundant power to each rack, and generators that are tested regularly. But I also assume that losing power is a matter of “when,” not “if.” That means making sure I’ve got good external communication channels to keep clients up-to-date, and it means making sure I can recover quickly after the lights come back on.
Speaking of lights coming back on… take a look at Rackspace’s play-by-play last night:
UPDATE: As of 1:15am CST, Rackspace Cloud engineers have identified that a Rackspace data center maintenance issue has caused residual intermittent power connectivity to our Cloud Sites system. […]
UPDATE: As of 1:30am CST, service has been restored to the majority of our technology clusters in our WC2 cluster. […]
UPDATE: As of 2:00am CST, we are still trouble shooting […]
UPDATE: As of 2:30am CST, […]
UPDATE: As of 4:20am CST, we are seeing a few residual issues […]
UPDATE: As of 4:27am CST, engineers corrected the outstanding residual issues and have once again fully restored service.
Hang on - I thought the post-mortem said that the PDUs were only down for 5 minutes? How’d a five minute power outage turn into over 3 hours of downtime (for some customers)?
This leads us to a corollary to this lesson:
Reboots must be automatic
If you accept the argument that power-related downtime is inevitable, then it stands to reason that you’ve got to be absolutely sure that your machines boot cleanly, and that all required services come up when they do.
I know, I know, we all like to brag about our uptime. Unfortunately, the only way to know that your machines boot correctly is to try it. It’s time to make a high uptime a mark of shame, not a badge of honor.
Good data centers fall over to backup power every week or so; it’s a similar good practice to regularly reboot machines. Forcing a sudden reboot will cleanly test your fallover solution — you’ve got fallover mechanisms for your machines, right? Plus, you’ll know that a five minute power outage won’t turn into 3 hours of debugging.
Finally, one last corollary:
Five nines is impossible
Really. It’s just not going to happen. If your SLA guarantees five nines you might as well set aside the penalty money from day one; you’ll be paying out sooner or later.
Think about it: 99.999% uptime translates to only 26 seconds of downtime in a month. Even if your machines boot perfectly ever time, the’ll never recover from even a split second of power loss. No cluster I’ve seen can boot in under 26 seconds.
Remember:
You will lose power, so be prepared.
Comments:
Don't know. Some providers offer more than 5 nines. And that's not unreasonable. A lot of them actually test their equipment on a weekly basis. That's the minimum.
Generally -- on server reboots, and re: your question, "How’d a five minute power outage turn into over 3 hours of downtime (for some customers)?"
This leads to other issues.
E.g. cheaper hardware might not survive the loss. I've seen it before. Broken HDD, doesn't matter if your services "would" come up, or not. Not suggesting RackSpace has this issue, but generally speaking.
Then, think of corrupted volumes, etc.. Never hurts to buy an APC and plug it in between, so you get a "controlled" shutdown sequence when the upstream dies. Cause have you ever rebuild a volume with a couple TB? Not so much fun, it's also not very fast.
Power Corrupts.
Great stuff, Jacob.
I ran my production servers on Rackspace during that power outage in 2007. We were down for nearly 24 hours, and it was the most frustrating, nerve racking time in my career because I thought it could never happen to me and was completely blindsided.
I toured the DFW facility, saw the millions of dollars in redundancy preparation and talked to very, very smart people about their DR plans. I never planned that it could possibly happen to me, and not being prepared was the worst part of it all. Even a static "Oops, something's wrong" page would have been better than the blank page we presented. (Yes, after we got our heads on straight and figured out what was going on we did this, but the point is we should have been prepared for this contingency ahead of time)
When I plan enterprise systems now, I always keep this in mind the way a test pilot makes sure his or her parachute is ready if the worst should happen.
I suspect this is another reason why Google goes with individual batteries on their system and DC power. DC is far far easier to work with than AC when wireing up the datacenter and ensuring things on multiple circuts work together. You preform the conversion any number of times and just wire up in parallel and you're set.
Google seem to use DC backup only on each rack, they use a high efficiency power supply on each system to go from AC 120 to DC 12v
I'm not sure you can have everything automatically reboot in a datacenter following a massive outage. Wouldn't the power surge after the power is restored trip more breakers and cause the process to start all over again? I know you have to power on servers within a rack a few seconds apart to avoid overloading the circuit for that rack.
I love this post. Power is usually the last thing people thing up when building their infrastructure or populating a rack. As someone who used to run a commercial data center, power was my nemesis. As we grew, we were constantly dealing with power issues big and small. Colocation customers never understood power. I also learned a ton. Most notably I learned the closing statement "you will lose power". You will also lose network connectivity, san connectivity, and your star engineer. You can't stop any of those things, you just have to prepare for them.
The quotes in entry titles are being double-escaped in your feed, so quote characters show up as character entity references.
PDUs can blow out. Any setup advertising redundancy without actually handling a PDU failure isn't redundant imo. This sounds like it was a bit further up the chain from what I'm familiar with though and at some point everything comes to a head.
When I build out a cabinet for redundant power I do make a point of asking for two different phases that I connect to two different PDUs in the cabinet. Also phase utilization at this point is important. You want one of the phases to to be able to completely take over should the other one blow out. Luckily there are nice big displays of phase utilization on the way in to the facility where I keep my stuff co-located.
If you have redundant PSUs and can't connect them to different PDUs don't even bother imo. you just increase the risk any shorts taking out both PSUs.
2000 days, 5 days ago btw. shame on me! :(
Rackspace suffered a real change when it went public last year. There was a cultural shift in which their entire value structure changed, which is reflected in the not-quite-fanatical support and barracuda sales efforts. It's frustrating to see an A-list host stumbling like this.
S**t happens!!
Just because Rackspace is horrible and can't sort out their datacenter issues doesn't mean that five nines is impossible. I've even had 100% uptime (from the end-user's point-of-view) on my setup in a single datacenter which has 2(n+1) redundancy and everything is setup to be redundant on my cluster.
Leave a comment: