Your data center is lying to you

analysis

Jun 7, 20104 mins

No matter how much monitoring you do, false status readings can slip through -- you can't just take the data center's word for it

It was one of those days. Around nine in the morning, I suddenly had to contend with an significant IT disaster: a failed UPS in a medium-size data center. The loss of all three phases kicked in the UPS, which held the load for all of six seconds before it quit. Poof! The whole data center went down.

Power was restored less than 20 seconds later, but the damage was done. Due to a variety of issues, I was then responsible for getting that data center back on its feet from 250 miles away. Because most of the servers ran Linux, the next hour was full of rapid keystrokes, IM communications, and a gallon of coffee.

[ Doing server virtualization right is not so simple. InfoWorld’s expert contributors show you how to get it right in this 24-page “Server Virtualization Deep Dive” PDF guide. ]

When a data center goes down and then back up without physical intervention, it doesn’t come up nicely. Storage arrays initialize after servers that try to mount their shares, while some servers boot without access to DNS servers that are also booting and thus have other problems — it’s a mess.

Luckily, there were no data corruption issues, and eventually all servers and services were returned to normal operating state. The next day consisted of trying to figure out why a massive UPS handling a 44 percent load decided to quit after just a few seconds, but that’s what postmortems are for.

The battery monitor lied But the problem is that before dropping the load, the APC Symmetra 40k UPS showed that everything was fine — except for a pesky failed self-test. I’d noticed the self-test failure before the outage, but there didn’t seem to be anything wrong with the UPS, and the logs did not point out any reason for the self-test failure.

All the monitored elements — batteries, intelligence modules, power supplies, everything — were green, and 100 percent battery capacity was showing on the management status page. I supposedly had 19 minutes of runtime at the current load. That 19 minutes turned into the aforementioned six seconds — and my day was shot.

In IT we depend heavily on monitoring (or we should if we’re not already), so it’s even more painful when that monitoring lets us down. In this case, there was no legitimate warning that overrode the healthy appearance of the UPS. This is the risk we take in trusting all that monitoring, and it’s a risk that cannot realistically be prevented.

A manual task would have prevented the problem The only thing that could have possibly prevented this outage is an old adage that I use constantly: Fire it before it can quit.

There are several common IT elements where you can apply that saying: hard drives, batteries, and some IT admins. In this case, the batteries may have been showing full capacity, but they were in fact three years old and should have been on a replacement schedule. An APC tech took a look at the UPS following the load-drop event and determined that even though the batteries appeared fine, when a nonload self-test was run, the unit’s output dropped severely, even without a load. Either the batteries were lying to the monitoring code in the UPS, or the monitoring code was lying to everyone else. Either way, the result was a day of chaos, lost work, time, and effort.

So even if you’re monitoring everything under the sun and keeping tabs on even the tiniest component in your infrastructure, take a moment to realize that sometimes the best thing you can do to reduce problems is to replace parts that might seem to be working fine.

This article, “Your data center is lying to you,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com.

Technology Industry

by Paul Venezia

Senior Contributing Editor

Follow Paul Venezia on X

Paul Venezia is a veteran enterprise architect and senior contributing editor at InfoWorld, where he writes analyses and reviews.

Show me more

Topics

About

Policies

Our Network

More

Your data center is lying to you

No matter how much monitoring you do, false status readings can slip through -- you can't just take the data center's word for it

More from this author

Congress has sold off your privacy—and U.S. security

Review: QNAP TVS-882T NAS piles on the features

Linux at 25: Linus Torvalds on the evolution and future of Linux

Linux at 25: How Linux changed the world

Why no one wins the tech holy wars

Sorry, dad, security isn’t what it used to be

Hey, Internet domain overlords, stop playing games

The end of Apple? The early signs may be in

Show me more

Databricks pitches Lakewatch as a cheaper SIEM — but is it really?

Google targets AI inference bottlenecks with TurboQuant

Basic and advanced Java serialization

How to run your own little local Claude Code (sort of!)

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy