Kristian Kielhofner takes tech troubleshooting to a whole new level in solving the 'packets of death' issue I figured I’d take a break from being assailed from all sides regarding the state of Perl (though I did find this very interesting, and especially this comment). No, this week I decided to go in a vastly different direction.Every so often, I like to highlight curious or extreme examples of troubleshooting. Besides being generally good reads like a detective novel, these write-ups serve to fit in little crevices in our minds, dormant until a similar situation crops up weeks, months, or years later. What makes a good troubleshooter is the ability to eliminate possible causes until the actual source of the issue is revealed, and equally, the ability to instantly call upon tiny flecks of information from the distant past, then apply that knowledge to the current situation.[ Cash in on your IT stories! Send your IT tales to offtherecord@infoworld.com. If we publish it, we’ll keep you anonymous and send you a $50 American Express gift cheque. | Get the latest practical data center advice and info in Matt Prigge’s Information Overload blog and InfoWorld’s Data Center newsletter. ] So when I happened across Kristian Kielhofner’s Packets of Death post, I figured it deserved recognition.This is troubleshooting at a level that most people never reach. When we’re investigating issues of a similar nature, such as intermittent network connectivity problems, we step through the usual suspects, and 999 times out of 1,000, we replace a patch cord, update a driver, or do something equally pedestrian. But to encounter an issue like the one Kristian faced, with multiple reports of problems within brand-new hardware platforms, completely different infrastructures, completely different clients, involving behavior that appeared to be OS-independent — there are no usual suspects.The best part of this write-up is that it’s not only a very peculiar real-world case that keeps you guessing, but Kristian details the tools he used and shows examples of how he arrived at the final answer. For those just entering the fray as a network engineer, or even those who want to add to their skill set, reading through his methodology and playing around with the tools referenced will only serve to enhance those skills. If you’re at all interested in networks, being on a first-name basis with everything from Wireshark to tcpreplay to Ostinato will make fixing many problems amazingly easier. This jaunt into the wild also serves to underline a little-recognized phenomenon: exactly how much we trust the lowest-level code in our infrastructures.The fundamental problem that Kristian faced was bad code in an Intel network controller, a bug that would shut down the interface if certain conditions were met — conditions that occurred intermittently on a general-purpose network. Throughout the millions upon millions of Intel-driven NICs out in the world, a bug like this is so rare as to be absolutely the last place you’d look for the problem, as Kristian found out. You certainly don’t look at a NIC EEPROM as the first step in deducing the cause of a network issue.For the most part, our blind trust in those devices is well placed. These situations are rare, and based on the sheer quantity of similar parts in our infrastructures, their reliability and stability is quite impressive. We work with operating systems of various flavors across our servers, hypervisors, switches, routers, and firewalls. We push bits around at a high level to force the lowest-level code to do our bidding. When you muck about with the configuration of a core switch, you’re manipulating the levers that lead to the code running on those ASICs. If that baseline code is not tight and bug-free, problems occur and reality begins to warp. That, right there, is the crux of the issue. We’ve all come across problems that appear to defy physics and threaten our very understanding of network construction and behavior. Problems that evade the pathways of our fundamental understanding of how things work are the worst of all. They might be as challenging as the one Kristian faced and require days or weeks of sleuthing to uncover, or they may be as simple and maddening as introduced licensing restrictions that go unnoticed, causing a network device to behave in bizarre, yet purposefully crafted ways. (If you ever run into a problem where you have intermittent host connectivity through a firewall that follows no discernable pattern, yet seemingly resolves after hours when you have time to work on it, do yourself a favor and check if the firewall has internal host count restrictions.)When all is said and done, it pays to remember if you’re facing a problem that defies all known laws of networking, you might be looking at a problem that lies in a place you can’t access or repair. All you can do is work to identify the culprit and hold the manufacturer’s feet to the fire until you receive a fix.Sherlock Holmes said it best: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This story, “‘Packets of death’ reveal road to enlightenment,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. Technology IndustryCareersIT Skills and TrainingUtilities