Wrapping everything up in the same box makes hard tasks easy and big problems bigger Try as we might to keep chaos at bay, there will come a time when the perfect storm hits and everything falls apart. Usually a confluence of elements triggers total meltdown, but sometimes one overlooked weak link fails and causes a cascade of problems that takes an entire network offline.These situations are never easy to deal with and are generally compounded by the fact that admins are feverishly working to fix problems while being bombarded with alarms from other systems that are also failing due to the initial outage. It’s like trying to rebuild a house while it’s falling down on top of you.To make matters worse, depending on the nature of the problem, tools that might be used for the fix are not available. In some cases, this includes Internet access. I can recall a time when the network was down and the only Internet source was pre-iPhone cellphones. There was no cell signal in the data center and no wireless access anywhere, so someone stood outside Googling for answers and relaying information back down to a crew member with a laptop jacked into a console port. Diabolical dependencies I’ve seen admins try to boot a broken virtual server to a PXE rescue image to recover a corrupted disk caused by a bad NIC, only to realize after several minutes that the PXE services are provided by the server they’re trying to fix. They then dig through the ISO share located on a nonproduction storage array for the boot image, but discover that nobody bothered to configure the array with a static IP and the DHCP lease just expired — because the broken VM was also the DHCP server. Utter mayhem.It’s during crises like these that some admins find religion and start promising they’ll do things differently in the future if only they could get this problem fixed right now. Lack of proper backups, lack of a backup plan, and lack of hardware support services make all of this much more challenging, but it’s always harder to recognize that fact when the seas are calm. If the absence of a $100 replacement part is keeping an entire network offline, you may want to rethink your strategy and budget.These days, we enjoy a significant concentration of technologies that make day-to-day IT work much simpler than in the past. The era of a corporate infrastructure built without virtualization are behind us, and we revel in the fact we’re running so many virtual servers on so few physical hosts. We can throw virtual machines around the infrastructure with abandon and completely renovate the underlying physical hosts without taking down a single production server. It’s miraculous — until a cog in that highly condensed world snaps and takes down many more systems than if we had never virtualized at all. Accidental Armageddon As an example, in the days of old, if a storage array stopped talking for whatever reason, it affected only the few servers using it. A single application might go offline, but the remainder of the infrastructure hummed right along.These days, even with all sorts of protections in place, if the wrong LUN is deleted or a storage array suffers from some failure, it can conceivably take out hundreds of servers. Even assuming replication and snapshots, it will be a long while before those servers are back online, and it’s highly likely that a significant portion of the infrastructure will remain offline with them. We live our lives with the understanding that a single misstep, typo, or inadvertent click can cause big problems, but when working with systems that support massive sections of the infrastructure, those clicks get even more riskier.We may have multiple domain controllers, DHCP and DNS servers, and contingency plans galore, but that matters none if they’re all affected by the same problem because they share the same fundaments. Once the clock starts ticking with those services offline, the problems begin to get worse — as desktop systems, phones, and other devices start to lose their DHCP leases, the disaster grows. Scheduled jobs that try to run begin failing for one reason or another, and start to stack up or cause data corruption problems. As we continue the trend of collecting all of our eggs in one basket, we need to understand the potential problems we’re creating and hedge our bets accordingly. Yes, we can build an entire corporate data center in software now, including everything from the routers to the firewalls to the load balancers to the network itself. We can tightly couple all of these components together to make management, upgrades, expansion, and production much faster, easier, and cost effective. These are the unarguable reasons that we’re virtualizing everything we can.While that foundation is stable, we reap the rewards. When the foundation fails, we have more work to do than ever to repair it.This story, “When virtualization becomes your worst enemy,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. Technology Industry