When the data center goes down: Preparing for the big one

No matter what you do, your power will fail some day. Here are the 4 steps you can take to prepare yourself

It’s a calm, sunny weekend day without a cloud in the sky. The barbecue is lit and beers have been cracked open. Things couldn’t get any better. But lurking somewhere in the power grid is a faulty component that’s been hanging by a thread for weeks. And it has picked today to be its last.

Power to the data center is abruptly cut. Uninterruptible power supplies assume the load in anticipation of the backup generator startup — a startup that never comes due to a tripped circuit breaker. A few minutes later, the data center plunges into darkness. Burgers and beers will have to wait. There’s work to be done.

This is a scenario I’ve seen play out in strikingly similar ways about once every year. The first I can remember was in a colocation center in downtown Los Angeles near the height of the dot-com boom. The last one was only a few days ago, on the morning of July 4. In the first case, a sizable office building containing three subbasements’ worth of data center gear were unceremoniously brought down, despite the presence of an enormous facilitywide battery-backup system, three mutually redundant backup generators — each large enough to power a small town — and path-diverse access to two separate commercial power grids.

The exact reasons behind the outages aside, it’s clear that no matter how much capital you’ve invested in your data center infrastructure (or in a state-of-the-art colo), you’re bound to lose power someday. However, very few of us actually go through the trouble to test our systems’ response to an unexpected power outage. In my experience, the larger the data center or organization, the less likely it will have tested a power outage on purpose. Unfortunately, these same large organizations have the infrastructural complexity that almost guarantee continued trouble even after power is restored.

As anyone who regularly reads my blog knows, I’m a huge proponent of testing everything, both during planned downtime and in the midst of production. However, I’m not so naïve to believe that everyone has the resources — or ever-important managerial backing — to do this kind of testing. Most of us simply have to prepare diligently, wait for the “big one” to hit, then deal with it the best we can. To that end, you can take a few steps ahead of a large-scale disaster — whether or not it’s power-related.

Step 1: Plan external access

Murphy’s Law dictates that if you’re going to have a large, data-center-wide disaster, it won’t occur when you’re ready and on site. In all but one instance I’ve seen, data center power outages occurred on a night, weekend, or holiday when a full-strength engineering staff had to be called in from home to stand the site back up. Unless your entire staff lives two minutes from the data center, cutting that automatic RTO (return to operations) penalty out of the mix can be a huge benefit.

However, providing remote access to a data center that may be completely dark, then actually making that access useful isn’t as easy as it might seem. It typically requires the implementation of a completely separate management network with its own significantly oversized battery-backed power (designed to provide runtime in hours rather than minutes) and its own dedicated Internet access. For more on that, check out one of my previous articles and another by my colleague Paul Venezia.

Step 2: Chart dependencies

The next most important measure to take is to build a dependency tree that includes all your major infrastructural components (and applications, if you can). This tree should show the order in which your systems should be returned to production. Most important, it will point out situations where you have a circular dependency.

Circular dependencies are often the worst, most time-consuming issues to resolve in the aftermath of an outage. For example, in the data center blackout that took place last week, a critical infrastructure service depended on a database server that depended on the availability of the infrastructure service to be reached on the network. During normal operating conditions, that worked fine, but in a power-restoration scenario, that tangle had to be pulled apart and fixed — costing an extra hour or so of downtime. Simply charting the dependencies ahead of time would have saved a lot of head-scratching and missed barbecue time.

Step 3: Introduce power management hardware and software

Once you have a solid idea of the order in which your systems should be returned to production, next organize your data center equipment and software in such a way that it will automatically return to operation in the correct sequence. In some cases, this will involve implementing intelligent power distribution hardware that can power on individual outlets using predetermined delays between each step. In others, you may be scripting power-on sequences for virtual machines or simply organizing virtual machines into vApps (in VMware parlance) so that they’ll start in a predetermined order.

This work need not encompass the entire data center. Instead, you may decide to focus on the most basic systems and manually sort out the rest. The most common order-of-operations issue that I see involve the SAN and virtualization cluster. In most cases, virtualization hosts boot faster than enterprise-class SAN hardware — especially large SANs. This can lead to a scenario where the virtualization hosts power up and fail to automatically restart virtual machines because storage isn’t yet ready. Simply delaying the startup of those hosts until the SAN is open for business can mean the difference between a largely automatic restart and a lengthy process that requires a lot of manual intervention.

Step 4: Learn from what you missed

Last, when a disaster strikes and puts your preparation to the test, try to take notes during the failure. What worked properly? What didn’t? Although you might not have the freedom to plan downtime for a scripted test, do not overlook the value of facing the real thing and having a chance to learn from it. After all, in the wake of a significant failure with associated unplanned downtime, purse strings will typically be looser and you may be able to implement preventative measures that management had previously not wanted to cover.

Whatever you do, do not make the mistake of assuming that because you have a generator and redundant UPS capacity you’ll be immune to a darkened data center. You simply aren’t. Given enough time, the holes in the layers of redundancy that you’ve deployed will line up perfectly and dominoes will fall.

Do your best to be prepared to bring the data center back up from scratch and to learn from the experience — you won’t regret it.

This article, “When the data center goes down: Preparing for the big one,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Technology Industry

Topics

About

Policies

Our Network

More

When the data center goes down: Preparing for the big one

No matter what you do, your power will fail some day. Here are the 4 steps you can take to prepare yourself

More from this author

The InfoWorld guide to disaster recovery done right

How to survive the data explosion

Review: Dell Compellent storage delivers flash speeds at disk prices

5 must-have capabilities for every monitoring system

The secret to troubleshooting performance problems

Cloud audits often don’t mean what you think they do

Where cloud backup fits the bill

What you need to know about today’s SSDs

Show me more

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

Google’s Stitch UI design tool is now AI-powered

Stop using AI to submit bug reports, says Google

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)