matt_prigge
Contributing Editor

Getting out of jail free

analysis
Jan 9, 20128 mins

Innocuous details can bring a network to its knees, so expect the worst and prepare for widespread outages before they happen

It’s 1:30 in the morning. By some miracle, you were able to get approval for a four-hour downtime window to complete a long list of overdue patching and network maintenance. Even better, you’re done a half-hour early. Life is good!

As you’re about to email the third shift to let them know they can get back in ahead of schedule, you remember it: that one setting you always knew was wrong and wanted to fix — and, you thought, shouldn’t cause any service disruption — so you never got around to correcting it. Little do you know that “fix” is going to be your undoing.

It doesn’t matter what it is. For me, it’s been an incorrectly set spanning tree bridge priority or UPS software configured with an inadequate shutdown delay. Either way, half a second after hitting Enter or clicking Apply, your terminal freezes, pings go unanswered, and panic sets in: You’ve brought down the entire network, and you have no idea why. You thought you finished 30 minutes ahead of schedule, but now that half-hour may not be enough time to run around with a laptop and console cable to figure out what happened, much less fix it.

You can’t avoid situations like this all the time. Bad things happen when you least expect it — the old adage “If it ain’t broke, don’t fix it” applies to IT as much as it does to any other field. Nonetheless, you can build safegaurds into your network that will drastically reduce the time it takes to fix problems when they arise.

Out-of-band management Look around a modern data center and you’ll find an overlooked resource in abundance: out-of-band management ports. These days, you can hardly buy any enterprise network, storage, or server hardware that lacks some kind of out-of-band management capability. These ports generally attach to dedicated processors or isolated IP stacks that remain available even if the device they’re attached to falls on its face due to misconfiguration or a hardware problem — providing an easy way to get into the system and determine if it’s the cause of an outage or just a victim.

All too often, these out-of-band management ports aren’t properly leveraged; either they’re never plugged in or fully configured, or they aren’t attached to the very production network that depends upon them. Usually it’s the result of simply not having enough time to configuring them when new systems are installed. After all, you can always plug in that management port or lights-out controller later; getting that server configured so that management will get off your back always seems more important.

Though you’re unlikely to use the management capabilities of these out-of-band management ports every day, I can’t overstate how useful they can be when you have a widespread network outage — especially if you’re troubleshooting it remotely.

Implementing a management network The trick to building a solid management network is to construct it so that it has absolutely no dependencies on any mission-critical services you’re tasked with maintaining. To that end, it should have its own network switches, its own power protection, and maybe even its own firewall and Internet connection.

That might sound like a massive amount of overkill — trust me, it’s anything but. Maintaining absolute separation from the production network is a great way to ensure security, but it also serves to protect that network from the very production failures it’ll have to diagnose and remediate. It also needn’t be expensive — I don’t know a data center admin who doesn’t have a stack of old switching gear sitting on a shelf. Nor does this stuff have to be fast; it just has to work. In most cases, setting up one of these networks will just cost you some time.

The first thing to do is fire up one of those old 10/100 switches that you pulled out of a rack last year. From there, attach all of your data center’s out-of-band management ports to it. You’ll want to IP those management ports with a subnet that’s entirely separate from your production IP schema (you won’t be routing them together).

Next, you’ll want to give yourself a good way to get onto your management network. There are several ways to do this, but they all have security and availability trade-offs. You might dual-home a management server or workstation on both your production and management networks. Often, any network monitoring server/software you have will be a good choice, especially since you can configure your software to monitor the devices over both networks, which can provide useful forensic data when troubleshooting short-lived outages. That way, you have a single server you can remote into from the production network to gain access to the management network. In an emergency, you can get to the console of that machine and easily diagnose problems.

If you want to be able to troubleshoot network-down emergencies remotely, throwing in a secondary firewall that serves only the management network (maybe one you outgrew or removed from a closed remote office — it doesn’t need to be anything special) is a great way to do this. Best case, you’d attach that firewall to a completely separate Internet connection, such as a low-cost subscriber-grade DSL or cable network. As with the firewall and network gear, it need not be fast. You just have to be able to fire up a VPN and run SSH/RDP over it.

Of course, all of your mission-critical hardware might not come equipped with an out-of-band Ethernet port. Instead, you may be stuck with a serial console port as your only out-of-band management capability. In those cases, consider buying (or recycling) a serial terminal server and making those console ports available on your management network. It might not be sexy, but it beats rushing into the data center with a laptop and serial cable — or worse, driving into the office in the middle of the night.

Throughout all of this work, don’t forget to be mindful of the security implications of what you’re doing. While it’s true that you may be improving security by segregating the soft underbellies of your network devices from the network at large, be sure you don’t simultaneously expose them by implementing overly permissive rules on your management network firewall or by using weak passwords.

What you get The benefits of having an isolated management network are huge. You can remotely tell the difference between a power failure, an Internet connection failure, or an internal network outage without so much as picking up a phone. Better yet, you can access and remotely diagnose all manner of hardware or configuration problems.

Perhaps the best example of where this capability can save your bacon is one from my own past. Many years ago when virtualization had just started to enter the consciousness of trade mags, I was wrapping up the implementation of a virtualization environment for a client. It was the middle of the night, and I was sitting at home in my PJs running through the project completion checklist. The last thing on my list was to double-check the configuration of the UPS auto-shutdown software. As I did so, I noticed that the software was configured to shut the systems down when two minutes of battery were left. Since that was insufficient, I bumped it up to five. Then I was promptly was disconnected.

As it turned out, the entire production network — core switches, virtualization hosts, you name it — had just been gracefully shut down because the UPS that the management software was connected to only had 4 minutes, 38 seconds of runtime. Though the management network wasn’t served by its own Internet connection or firewall in this case (the site only had T1 connectivity, so DSL/cable wasn’t an option), the firewall, management network hardware, and Internet hardware was on its own UPS. After a bit of head scratching, I was able to get back into the network and remotely command the UPS to turn itself back on and power everything back up.

It wasn’t one of my finest moments, but is a great example of how incredibly useful isolating and protecting your out-of-band management interfaces can be. If you don’t have this kind of capability now, there’s no time like the present to implement it.

This article, “Getting out of jail free,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.