matt_prigge
Contributing Editor

Monitor your storage — or else

analysis
Apr 12, 20106 mins

Long-term trending is critical to maintaining a surprise-free storage infrastructure

We all have a tendency to put off responsibilities that pay off over the long haul but seem to provide little immediate value. So it goes with long-term monitoring of storage, an activity essential to the success of any centralized primary storage environment. Monitoring is often forgotten until it’s too late to be useful. If you don’t want your storage architecture to become obsolete before its time, you need to implement monitoring now.

Why you need to monitor storage

A comprehensive storage monitoring strategy helps you catch problems before they’re problems. Whether it’s simple stuff, such as being able to forecast when you’ll need more disk, or the more complex task of determining whether an application slowdown is storage-related, monitoring is often the only way to answer these questions with any certainty. Without an early-warning system, sooner or later you will be faced with an unexpected capital investment or a prolonged troubleshooting adventure.

Back in November, I detailed a real-world example of this in “Scenes from a disaster: An upgrade gone too far.” In that instance, an enterprise was bitten by the increased performance demands of an upgraded version of a line-of-business application. No monitoring solution capable of trending performance was in place, and the new version went live without anyone evaluating whether the new release was going to push the storage harder than the previous version had. Long story short, it did — and how.

The only tried-and-true way of avoiding that problem and hundreds more like it is to have a monitoring framework that lets you see what your storage is doing, both right now and well into the past. How else can you expect to identify an unusual situation? Also, monitoring can accelerate diagnoses that your storage system is not the cause of a problem you’re experiencing. The faster you can rule out part of your infrastructure as the cause of a problem, the faster you can move on to address the real cause.

What you need to monitor

So if monitoring is so important, what exactly do you want to see? Some data points are self-evident, some less so. The most obvious thing people think about is capacity. Having graphs of storage pool usage going back for years is incredibly helpful when figuring out when additional storage resources may be necessary.

But if the CFO asks you today whether you’ll need to make additional storage purchases next fiscal year, can you answer that question with any certainty? If not, you may not have the right information in front of you. Simply knowing how much storage is in use now won’t help — you really need to know how much was in use two years ago and how much it has grown since.

Less well understood is performance monitoring. Anemic performance can cause problems as often as depleted capacity can, but performance monitoring is a much darker art. Modern storage architectures tend to be complex and a great many data points must be collected in order to get a complete performance picture.

Often, people focus on how much data flows to and from the storage tier, as you would with a network. But that’s just one metric. Storage systems are far more limited by the number of transactions they can handle than the amount of raw data they can push in one direction or another. If you monitor nothing else, make sure to monitor your storage system’s transactional latency. That number gives you a great one-look view of how well your storage is holding up under the load you’re throwing on it. Roughly speaking, a transactional latency under a normal production load of anything under 20 milliseconds is good, and anything over that level may indicate that your storage is having a hard time keeping up.

How to monitor your storage

How you monitor your primary storage usually depends on what you have. Different SAN and NAS manufacturers provide vastly different monitoring capabilities and tools — and some don’t provide any. Some vendors provide dedicated, easy-to-install tools that monitor just about anything you’d need to know. Other vendors provide complex, hard-to-manage enterprise monitoring tools that lack flexibility.

A great example of a solid, vendor-provided monitoring tool is Dell EqualLogic’s SAN HQ monitoring suite. SAN HQ is a fairly simple, but exceptionally useful tool that allows you to monitor your EqualLogic Peer Storage array via its built-in SNMP capability. It captures all of the important information and makes it easy to review. Better still, since it uses SNMP, it’s fairly trivial to incorporate any of the information you see in that tool into any other SNMP-based monitoring suite you might already have.

If your storage platform doesn’t come with decent monitoring software, you may be able to get all of the information you’ll need through SNMP or something similar. Then the trick is figuring out how to record the information and store it for review later.

Better living through monitoring storage yourself

Although it will take some effort, the best way both to learn a lot about your storage platform and to ensure that you are monitoring everything that is important to you is to do it yourself. Even the best, most expensive monitoring platform may not be able to monitor all of the devices that have some influence on your storage architecture without some additional effort on your part — whether that’s monitoring the storage switches or the SAN itself.

If you have some experience working with SNMP monitoring, I’d strongly recommend checking out Cacti, an open source, PHP-based monitoring framework that uses RRDTool databases and graphs to maintain long-term trending data. As with any free solution, it has some rough edges, but it’s infinitely malleable. Whether you’re talking about network throughput, CPU usage, capacity usage, or disk latency, if you can get the number, you can graph it in just about any way imaginable.

Although Cacti takes some tinkering to get up and running, the payoff can be huge. Cacti has a vibrant community of contributors who have written easy-to-import device templates and plugins. A great example of that is Howard Jones’ Network Weathermap, which can run as a Cacti plug-in. Cacti-obtained data produces immensely useful current-state flow diagrams. For example, you can graph the bandwidth usage of a dual-fabric Fibre Channel SAN, chart latency, and monitor transactions per second.

The monitoring imperative

No matter how you do it, make sure you have a good monitoring system in place for your storage architecture. Even if you don’t study the output every day, you’ll find monitoring indispensable when diagnosing unexpected performance problems. Moreover, you’ll be fully armed with hard information when you need to secure future capital investments — and do it in a sane, planned way. Remember, historical data gives monitoring systems their power, so the sooner you install one, the better. Critical performance data you fail to monitor right now is gone forever.

This story, “Monitor your storage — or else,” was originally published at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com.