No matter how much monitoring you do, false status readings can slip through -- you can't just take the data center's word for it It was one of those days. Around nine in the morning, I suddenly had to contend with an significant IT disaster: a failed UPS in a medium-size data center. The loss of all three phases kicked in the UPS, which held the load for all of six seconds before it quit. Poof! The whole data center went down.Power was restored less than 20 seconds later, but the damage was done. Due to a variety of issues, I was then responsible for getting that data center back on its feet from 250 miles away. Because most of the servers ran Linux, the next hour was full of rapid keystrokes, IM communications, and a gallon of coffee.[ Doing server virtualization right is not so simple. InfoWorld’s expert contributors show you how to get it right in this 24-page “Server Virtualization Deep Dive” PDF guide. ] When a data center goes down and then back up without physical intervention, it doesn’t come up nicely. Storage arrays initialize after servers that try to mount their shares, while some servers boot without access to DNS servers that are also booting and thus have other problems — it’s a mess.Luckily, there were no data corruption issues, and eventually all servers and services were returned to normal operating state. The next day consisted of trying to figure out why a massive UPS handling a 44 percent load decided to quit after just a few seconds, but that’s what postmortems are for.The battery monitor lied But the problem is that before dropping the load, the APC Symmetra 40k UPS showed that everything was fine — except for a pesky failed self-test. I’d noticed the self-test failure before the outage, but there didn’t seem to be anything wrong with the UPS, and the logs did not point out any reason for the self-test failure. All the monitored elements — batteries, intelligence modules, power supplies, everything — were green, and 100 percent battery capacity was showing on the management status page. I supposedly had 19 minutes of runtime at the current load. That 19 minutes turned into the aforementioned six seconds — and my day was shot.In IT we depend heavily on monitoring (or we should if we’re not already), so it’s even more painful when that monitoring lets us down. In this case, there was no legitimate warning that overrode the healthy appearance of the UPS. This is the risk we take in trusting all that monitoring, and it’s a risk that cannot realistically be prevented.A manual task would have prevented the problem The only thing that could have possibly prevented this outage is an old adage that I use constantly: Fire it before it can quit. There are several common IT elements where you can apply that saying: hard drives, batteries, and some IT admins. In this case, the batteries may have been showing full capacity, but they were in fact three years old and should have been on a replacement schedule. An APC tech took a look at the UPS following the load-drop event and determined that even though the batteries appeared fine, when a nonload self-test was run, the unit’s output dropped severely, even without a load. Either the batteries were lying to the monitoring code in the UPS, or the monitoring code was lying to everyone else. Either way, the result was a day of chaos, lost work, time, and effort.So even if you’re monitoring everything under the sun and keeping tabs on even the tiniest component in your infrastructure, take a moment to realize that sometimes the best thing you can do to reduce problems is to replace parts that might seem to be working fine.This article, “Your data center is lying to you,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com. Technology Industry