paul_venezia
Senior Contributing Editor

The true grit of IT troubleshooting

analysis
Feb 19, 20135 mins

Tackling big problems as they emerge takes nerves, luck, and a gambler's instinct. Armchair quarterbacks need not apply

Troubleshooting brings out the armchair quarterback in most IT pros, especially during emergent and highly visible outages. After all, we love a tough problem, and it’s easy to gloss over a situation from a disassociated position and say you could have solved it better, faster, or with less fallout.

We even do it to ourselves: How many times have we said that if we only knew then what we know now, we would have done differently? There is no disassociated position when troubleshooting. We only know what we know in that moment, so we forge ahead with our best guess, balancing the risks of our actions against the promise of a resolution. For the most part, our gambles succeed; otherwise we wouldn’t be in the position to do much troubleshooting again.

[ Cash in on your IT stories! Send your IT tales to offtherecord@infoworld.com. If we publish it, we’ll keep you anonymous and send you a $50 American Express gift cheque. | Get the latest practical data center advice and info in Matt Prigge’s Information Overload blog and InfoWorld’s Data Center newsletter. ]

Dealing with emergent problems means making a devil’s choice between a fix and forensics. Speed up the time required to fix a particularly gnarly problem, and you might end up destroying data that could reveal the problem’s actual cause during a postmortem. Seasoned IT pros try to have our cake and eat it too. But in the end, returning to normal operation trumps collecting data for forensics, so it’s not always possible to retain as much data as we might like to get.

Unfortunately, this often means reliving the issue again.

The true test of IT troubleshooting

Those who have never dealt with a true IT emergency can’t understand what it is like, largely because it’s almost impossible to describe the situation accurately. Phrases like “firefighting” or “life-saving” are bombastic to the layperson, who generally thinks of computer problems as being solved by unplugging a DSL router.

But when a blocking problem appears out of nowhere, especially during what might otherwise have been a routine procedure or when no work was under way at all, the same basic response takes hold of those of us who have to fix it. Every second that passes is a further indictment on the IT operation as a whole. There are no breaks, no timeouts; there’s just a problem looming over all else and a brain or two spinning like centrifuges, looking for a way over, under, around, or through the problem, causing as little collateral damage as possible.

A better analogy might be that dealing with a major IT problem is like being locked in a room without food or water — all other thoughts besides getting out of that room evaporate. There is and must be a singular focus on finding a key and working yourself free of the confinement.

Networking issues are generally more forgiving than those involving servers or storage. Usually it means a circuit is down, routes have been mangled, or a switch has failed. The outages can cause big problems, but once the source is determined and the issue is fixed, everything pops back to normal quite quickly, with little chance of further damage. That’s not to say that network troubleshooting is any easier — it certainly is not. But it’s easier to play fast and loose when troubleshooting the network because the effects of your efforts are almost always reflected immediately.

That’s not the case with servers and storage, where most troubleshooting decisions require intense deliberation. That’s because choosing the blue pill over the red pill might fix the issue at hand, but might also mean several hours of work restoring data from backups or restoring physical or virtual servers in full. Furthermore, the effects of your efforts are not necessarily immediately visible. It may take hours to know if issuing a certain command on a SAN will elicit the desired result, or if standing up a snapshot and working on that will get a critical component back online without too much data loss. Simply defining “too much data loss” is a gambler’s game.

As time wears on during an emergency, we become more cavalier in our testing and decisions. Whereas initially we may have been reticent to reboot a virtualization host that has lost all contact with the cluster, yet is still hosting functional VMs, after an hour or two of picking around at why it lost its mind, we may finally understand there is no other choice. We then go into unplanned downtime, power down as many of those VMs as we can, cut our losses, and pull the trigger.

Fixing more than just symptoms

As I mentioned last week, many of the most intractable, brain-bending problems are the result of bugs somewhere within infrastructure components that we cannot see or even diagnose without extensive and time-consuming forensics work, and usually a high-tier engineer from the vendor. When you have hundreds or thousands of people sitting idle while you poke through a critical system, those are luxuries that are expensive indeed. Once a prospective fix is determined and appears to solve the problem, it’s usually implemented immediately, and fingers are crossed in hopes that it holds and the problem does not occur again.

Those are exactly the kinds of problems that are likely to occur again, because they were never actually fixed; only the symptoms were addressed. If there’s time and budget for a lab, these issues can sometimes be recreated under less damaging conditions, and the source of the problem can be uncovered and fixed. Otherwise, all you can do is add it to the long list of big problems that happen once or twice and that you hope never happen again.

Such is life in IT.

This story, “The true grit of IT troubleshooting,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.