Stand back and let IT do its job

analysis
Nov 28, 20125 mins

When the network goes down at a national power grid operator, a techie hits upon the solution -- and proves the CIO wrong

No admin looks forward to a network outage, but add the national power grid and a blame-happy CIO to the mix, and you may find yourself undergoing an impromptu test. Not only are your technical skills under examination, but also your ability to keep your cool as accusatory eyes watch your progress in resolving the crisis.

I was thrown into this very situation over a decade ago when I worked for a network provider and was dispatched to fix a problem when a customer called about a large-scale disruption. The problem occurred at the head office of the national power grid operator in my country. The CIO there had a long-established track record of rapidly attributing blame elsewhere whenever things went wrong, so getting a call from him immediately put us on guard. I was not looking forward to what interactions I’d have with him while fixing the mess.

[ Ditch the slackers, take on dirty work, do it with data: 12 effective habits of indispensable IT pros. | Follow InfoWorld’s Off the Record on Twitter for tech’s war stories, career takes, and off-the-wall news. | Subscribe to the InfoWorld Off the Record newsletter for your weekly dose of workplace shenanigans. ]

The CIO met me when I arrived and explained the situation. He was very tense and said his IT staff had determined that some component of the network — that is, my company’s responbility — must be causing the problems. He didn’t blame us outright, but his tone was accusatory. I mentally geared myself up to deal with what was very probably going to be a long and stressful day.

The mood at the site was panicked to say the least. The office had about 500 employees, and by the time I arrived about half of them had been unable to work for almost two hours. The IT staff were investigating to see if other sites or the national power grid were affected. The CIO was under extreme pressure from other managers to resolve the problem — yesterday.

The corporate network used two ranges of IP addresses with a large core router providing the link between the two halves of the network. About half of the on-site PCs were intermittently losing access to servers in the other subnet.

The internal IT support personnel had been working on the issue but had hit a roadblock. They’d come to the conclusion that a major fault had arisen in the core router. They had started an investigation but found that the network breaks were so severe they couldn’t log on to most of the network equipment to begin troubleshooting.

The CIO was frantic and hovered over my shoulder through every troubleshooting step, demanding updates every minute. I eventually asked him to wait outside the room so that I could get on with the job. To his credit, he complied.

Further investigation showed that the hardware address of the router appeared to change for a few minutes at a time, then revert to the expected address. To my surprise, opening a terminal session to the router when the hardware address was “wrong” resulted in a log-on prompt from a well-known brand of print server.

This organization had many older printers that had been network-enabled by connecting them to print servers, and now I had to find the one causing problems. The CIO was hovering outside the door and looked puzzled when I explained to him what I was looking for.

We located the print server and discovered that an incorrect network configuration was loaded. We removed the print server from the network, then reconfigured and reconnected it. We checked to see if the connectivity problems were resolved: They were.

We pieced together what had happened. There had been a power brownout in the area earlier that day, which prompted the offending print server to restart. When it did, it loaded a stored network configuration that gave it the same network address as the corporate router. Then it was a matter of luck as to whether a PC tried to use the router or the print server to access servers on the other subnet.

The print server had operated without a hitch for more than three years running a correct configuration, but also storing a second, incorrect configuration. Its luck had run out.

The systems controlling the national grid were never at risk, which was a relief to the CIO, and productivity at the office was restored.

The CIO was appreciative and thanked me for figuring out the problem. But what else could he say when the evidence of the problem ultimately came down to a human error by someone on his staff? I must admit, the end result of this situation was definitely satisfying to my company’s team.

This story, “Stand back and let IT do its job,” was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

infoworld_anonymous

Since 2005, IT pros have shared anonymous tech stories of blunders, blowhard bosses, users, tech challenges, and other memorable experiences. Send your story to offtherecord@infoworld.com, and if we publish it in the Off the Record blog we'll send you a $50 American Express gift card -- and, of course, keep you anonymous. (Note that by submitting a story to InfoWorld, you give InfoWorld Media Group, its affiliates, and licensees the right to republish this material in any medium in any language. You retain the copyright to your work and may also publish it without restriction.)

More from this author