Two real-life IT misadventures defy belief -- and offer food for thought on how the whole mess might have been avoided IT can be hilarious. Every once in a while, a bizarre set of circumstances will produce a situation that you might not believe unless you were there to see it. The stories of these events are the stuff of IT folklore and get retold over beer and peanuts year after year. But just because they’re funny — in hindsight — doesn’t mean they lack good object lessons. In a recent meeting with a client, I was reminded of two stories I’ll never forget. The doomed cabinet Back when data centers were still dominated by mainframes and Intel-based servers were just starting to make inroads, a client was faced with a data center floor space problem. New systems were being added at an alarming rate, and a reorganization was needed to make room. This particular move required an oversized enclosed cabinet to be removed and its contents — primarily networking equipment — to be shifted to a smaller relay rack in a different part of the room. Most of the equipment could be transported easily. The one real challenge was a single network switch, a Cisco Catalyst 2980, that essentially ran the entire data center. Throughout the 1,500-seat organization, any app that wasn’t delivered through a green screen depended at some point or another on this switch to get to server resources. When faced with a request for the 20 or 30 minutes of downtime required to complete the frantic uncabling, unracking, reracking, and recabling, the organization’s upper management flatly refused. To them, even this relatively limited amount of downtime was too much for the organization’s 24/7 operations to suffer for a non-emergency event. Even after extensive haggling over exactly how many minutes it would take, approval for the downtime still couldn’t be obtained, nor did management want to pay for a second switch that would have allowed an orderly transition from one rack to the next. After a great deal of deep thought and soul searching, the IT team hatched a plan. Like all truly great plans, it required the labor of three people with no fear, a reciprocating saw, and absolute secrecy. Late one night when nobody would be roaming the halls to hear the noise, the IT ninjas sprung into action. One detached the switch from the rack and held it over his head along with the mass of cables attached to it. Another, wielding a Super Sawzall, deftly hacked apart the cabinet from around the switch and cabling, while the third discarded the pieces of the cabinet as they were hacked off. At the conclusion of the process, the switch was carefully balanced on a milk carton, and the cabling was rerouted in such a way that the switch could be mounted in the new relay rack. In the end, nobody asked them how they did it — and nobody cared. As long as there was no downtime and operations continued, it wasn’t their concern. This is a tale of pure ingenuity and chutzpah. But there are a few good takeaways, too. One lesson is that sometimes you just need downtime. Announced downtime is always better than unexpected downtime. Users can prepare themselves for the outage and deal with applications being unavailable. Sure, it might decrease productivity for a while, but that’s a far sight better than it happening without any warning. I mean, imagine if the saw man hadn’t had such steady hands. I shudder to think what would have happened if the cables had been hacked off by mistake or the switch had been dropped. Sourcing a replacement switch and reterminating 70-odd cables would take a little bit more than 30 minutes. Another lesson is that investing in redundant infrastructure components — in this case, a secondary core switch — often buys you more than failover capacity if one of the components should fail unexpectedly. It also buys you the flexibility to deal with operational changes that would otherwise require downtime. That’s one reason server virtualization and associated features like VMware’s vMotion and Storage vMotion have become so popular. Taking a virtualization host out of service for upgrades or shuffling data across to a new SAN volume simply doesn’t require downtime anymore. Sometimes it’s hard to sell that value to management, which can’t see it in use every day, but it’s almost always money well spent. The soggy switch Years later, the same client had wisely taken these lessons to heart and invested in dual-redundant core switching. Yet that redundancy can’t save you from everything — especially not an HVAC technician with a blowtorch. Without the IT folks’ knowledge, a new string of air conditioners was being installed in a different part of the building. Part of that installation required plumbing to be installed along the ceiling of one of the network closets. Located within this closet was a Cisco Catalyst 4506 switch that was responsible for aggregating a few hundred desktops from some of the most critical areas of the campus. Unbeknownst to the HVAC tech, the room was also wired with a heat sensor attached to the fire suppression system. Within minutes of starting his work, the tech’s torch set off the heat sensor, which in turn triggered an 80-gallon-per-minute deluge of water from the high-volume industrial sprinkler head. As fate would have it, that sprinkler head was located almost directly above the poor C4506. IT was notified almost immediately — and reached the closet a minute or two after the fire department arrived. The first thought on everyone’s mind was to kill power to the equipment to prevent it from being damaged by the almost inevitable short circuit. Unfortunately, the company’s strict safety rules dictated that only the fire department could enter an area in which a fire suppression system had been activated. They watched helplessly as a nonstop wave of water poured down like a Biblical flood. Ten minutes later, someone apparently oblivious to the waterworks in front of them ran up to say that the network in the area had just gone down. Apparently, the switch had done its best as it was being drowned by water, but eventually gave up the ghost and went offline. Not long after that, the fire department gave the all-clear. The IT folks entered the room, where the damage was plain to see. Everything was soaked — cables, patch panels, you name it. One member of the IT team was about to call Cisco for a replacement switch, but another staffer suddenly remembered that the exact same model switch was sitting in a box on the loading dock. It had been purchased to replace a different switch elsewhere on campus and simply hadn’t been deployed yet. Fortunately, the network admin had insisted that the patch cables be labeled with the port that they were supposed to be attached to. The process of swapping out the cabling, shoving the new switch in the rack, recabling, and restoring a saved switch configuration would only take a half-hour or so. While one member of the team ran down to retrieve the new switch, another started yanking cabling out of the 200-odd network ports. When he was about halfway through, the unthinkable happened: The switch turned back on. Nobody had remembered to shut the power off. As it turned out, the switch had gone offline not because it had shorted out, but because the water had slowed the exhaust fans down to the point where the switch shut itself off to prevent heat damage. After the switch finished its boot cycle, the network actually came back up for the users who hadn’t yet had their cables unplugged. After recovering from shock, the team decided to press ahead with the replacement. Just because it was up now didn’t mean it would stay that way. Years after it was written off by the HVAC company’s insurance, that water-tested switch still sits under a workbench in the data center right next to an identical spare that was purchased just in case that kind of disaster were ever to happen again. There are a raft of lessons you can learn from this one — for example, not letting plumbers work in your network closet. Less obvious, perhaps, is the prework that ended up saving their bacon. Maintenance tasks like labeling cables and maintaining up-to-date configuration file backups are often overlooked, but they will save your hide if you need to do an emergency rip/replace. Another major tip to remember is that those fancy support contracts with 2- or 4-hour on-site replacement guarantees aren’t likely to help you in the event of this kind of disaster. After all, Cisco engineers doesn’t design their switches to be submersed. If a spare hadn’t been on hand, days may have gone by before a new replacement arrived. You simply can’t pick up a $50,000 switch at your local Wal-Mart. There’s no replacement for having on-hand spares you can really depend upon. In fact, in larger deployments like this one, you can even save money. For example, this company no longer pays for high-end support contracts on their access-layer equipment — they maintain spares and opt for far less expensive next-business-day contracts. Lower cost and quicker recovery from failure — what’s not to like? Next time you trade IT war stories over a few beers, make a mental note of the lessons to be learned. It may be someone else’s experience, but it’s still the best teacher. This article, “The greatest IT stories never told,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. CareersIT Skills and TrainingIT JobsTechnology Industry