matt_prigge
Contributing Editor

CYA: 3 rules to keep a systems crisis from taking down IT

analysis
Oct 15, 20125 mins

Even the most redundant of infrastructures can be brought down by a lack of readily accessible knowledge

These days, all but the smallest organizations spend mountains of money building redundancy into their infrastructures. As business depends more and more on those systems to function at even the most basic levels, the capital plowed into highly redundant disk arrays, bulletproof backups, and highly available virtualization infrastructures has become an expected cost of doing business.

However, the frenetic pace of break/fix, application rollouts, and systems upgrades often leads to the most dangerous single points of failure of all: people. That huge investment in redundant, self-healing infrastructure can be negated in one fell swoop if the one person who knows how to run some critical part of it quits, is on vacation, or even just went out for lunch without a cellphone at just the wrong time.

Often, you don’t need an actual service disruption to cause a five-alarm political fire that reaches all the way to the executive suite. In my years as a consultant, I can’t count the number of times I’ve been called in (sometimes at a substantial expense) to help fill a knowledge gap simply because a single member of the IT staff wasn’t available for whatever reason.

Looking back on those incidents, there are three items every member of the infrastructure team should have at their fingertips, whatever their role.

1. Build basic documentation

If you’ve built and been responsible for a critical infrastructure system — storage, virtualization, or a complex application — you probably can recite every detail about it, IP addresses and all, without looking at a single piece of paper. But how much of that information does the guy or gal in the office next to you know? Often, very little — sometimes, nothing at all.

If there’s one thing most IT pros really hate doing, it’s writing documentation (I count myself in that group). However, even the most basic tabular documentation that shows which physical systems are responsible for which applications, what their names are, and which rack they are in can make a huge difference when something goes wrong while you’re not around. I’ve seen massive damage done in the course of a lunch break simply because an uninformed staff member tried to troubleshoot the wrong system for lack of the most basic shred of current documentation.

Having basic documentation may not allow an unskilled member of the team to fix a complex problem, but it will allow him or her to zero in on where a reported problem might be originating and double-check some of the most obvious causes. Critically, it short-circuits the often politically damaging scramble to answer even the most basic questions from users and stakeholders.

2. Track support contacts

Another potential for huge problems when a primary admin is unavailable is when the rest of team members don’t know where to escalate a problem. If a disk dies in your storage array while you’re on vacation, does the rest of the team know who to call to get a new one? Is there a support contract number they need to know? What if a WAN circuit goes down? Will everyone know which telco provider services that line and who to call?

That kind of backstop sounds incredibly simple — and it absolutely is — but if that kind of information isn’t widely available (say, a printout on a bulletin board in the data center), it can delay the fix and sometimes even cause a general panic around a problem nobody knows how to fix.

3. Back up and restore

Everything I’ve mentioned so far boils down to having the right information easily available to everyone who might need it. But there are two hands-on activities that most of the IT team should be fully versed in doing themselves: Ensuring that backups are being made and being able to restore them. Backing up and restoring are simply too important to depend on one or even two people to do.

The danger of not having broad backup and restore skills in IT goes beyond the need to restore during a disaster. It can hurt the business and IT during routine work as well. I’ve witnessed several such nasty surprises, such as restoring the wrong data and overwriting production data with outdated or test data, failing to include a new production system into the backup rotation (leading to it never being protected even months later), and simply giving the CEO the impression that nobody knows how to restore his email.

Backup operations shoud be a rotating task so it becomes the responsibility of everyone in the team at one point or another. Such rotation ensures that everyone is truly comfortable with performing restores and someone is likely to notice failures or systems that are out of compliance.

Putting it all together

In a perfect world, every IT team would include primary, secondary, and tertiary admins who are fully trained on every aspect of the infrastructure and can cover each other as needed. However, at the rate things change in IT and within IT departments, this goal is often nothing more than a pipe dream.

Still, that kind of coverage is a worthy goal to strive for. Even if you can’t accomplish it, you can at the very least make sure you’re leaving the necessary documentation breadcrumbs for other admins to pick up for you while you’re gone and that enough people are thoroughly cross-trained on the most critical tasks.

This article, “CYA: 3 rules to keep a systems crisis from taking down IT,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.