A tale of two SAN failures

Enterprise-class storage systems are incredibly fault tolerant, but even the most reliable systems can fail in the hands of users

By definition, enterprise-class primary storage is rock solid. It’d better be — without bulletproof storage, the rest of the application infrastructure built on top of it can’t hope to be reliable. Applications have enough problems of their own without flaky storage making matters worse. That’s why enterprises spend huge portions of their IT budgets to buy the best, most reliable storage infrastructure they can afford.

Redundant disks, redundant controllers, mirrored cache, and redundant storage networking fabrics go a long way toward delivering the kind of fault-tolerant storage infrastructures we’ve come to expect in mission-critical environments. But even the most highly redundant storage architecture out there can’t protect itself from one threat: you.

Lest you be offended, “you” includes me, too. Among the huge number of enterprise storage devices I’ve laid hands on over the years, only one ever crashed catastrophically due to hardware failure. On the other hand, I’ve lost count of the number of outages I’ve seen caused by bad documentation, bad tech support advice, insufficient training, and software or firmware that somehow overlooked the fact it might be used by actual human beings one day.

As if to underscore this phenomenon, I’ve witnessed two primary storage environments crumble to the ground within the past month. Cue the ominous music.

In the enterprise storage infrastructure, users are supported by two separate yet equally important groups: the devices that store their data and the people who manage them. These are their stories.

A firmware revision too far

In the first case, a new storage device was being introduced to an existing storage environment. The storage architecture in question united multiple storage devices into groups that form a seamless storage environment. The goal of the implementation was to replace an existing storage system with a newer one, and the existing system would move to a different site.

Due to the fact that the new system was built on newer hardware, it required and shipped with newer firmware than was currently running on the existing system. Upgrading the existing system to match the new one was certainly possible, but it required a maintenance window that would have been difficult to schedule.

A check of the documentation seemed to indicate that the specific versions in question would coexist during the time it would take to migrate the data from one device to the other, so work proceeded with the two devices running slightly different firmware.

At first, things progressed nicely. Test data was placed onto the new device, and it was thoroughly burned in. Performance test metrics exceeded expectations, and no problems were observed. Following that, a few less critical data volumes were hot-migrated; it too went well. The decision was made to move forward with the migration of all production data. Due to the large amount of information in question, the migration took several days.

Not more than 30 minutes after the migration completed, the new storage device — now containing all of the organization’s production data — promptly fell out of the storage fabric and became completely inaccessible. Fortunately, this took place on a Saturday morning when few users would notice, but it brought down the entire virtualized server infrastructure that depended upon it.

After power-cycling the new device, it came back up on the fabric with no further coaxing. It took several hours of concerted effort to get all of the systems up and running again, but there was no loss or corruption of data.

The next task was to figure out what had just happened. Was the new device unreliable? Was there a hardware or implementation problem that hadn’t been detected during testing? After two weeks and a multitude of emails and phone calls with the manufacturer’s tech support, it turned out that the failure was indeed caused by the firmware mismatch between the two devices. Eventually, the manufacturer was able to point to a piece of documentation indicating that running mismatched firmware wasn’t recommended for any longer than it took to upgrade all of the devices to the same version (which, in the manufacturer’s mind, turned out to be an hour or so).

The wording of the documentation was extremely loose — leaving the length of time to the reader’s imagination, with vague instructions like “consider.” No doubt the funky manual also played a large part in the manufacturer taking so long to determine the problem.

The unlucky onsite technician

In the second case, a production storage array started throwing some transient errors that seemed to indicate a problem with one of its back-end Fibre Channel loops. As all systems of this class utilize multiple, redundant loops for connectivity between the controllers and the disks, there was no immediate cause for concern. But it did warrant additional investigation and corrective action.

The array was covered under a high-end support contract, so the manufacturer quickly dispatched an onsite technician to determine the cause of the errors. He arrived and immediately got to work digging through logs and inspecting the hardware for simple problems like bad cables or transceivers.

None of his original troubleshooting yielded results, so he dug deeper and started dumping raw performance data directly from the controllers. In this particular platform, that’s done by connecting a serial cable to a maintenance port and issuing specific commands in the management software. The goal is to get the controller to cough up extremely detailed error information that can later be decoded by higher-level engineers.

This serial port also turns out to be a direct interface to the controller’s software kernel. Typing the wrong characters can have some fairly serious consequences. In this case, it turns out that hitting Ctrl-Z on your terminal emulator while connected to the serial port will set in motion a chain of events that causes both redundant controllers to reboot simultaneously — a process that will unceremoniously drop all inbound storage connectivity.

If you look at your keyboard, you’ll see that the Ctrl button and Z button are not that far apart. If he didn’t know that beforehand, the poor technician isn’t likely to forget now.

As in the first case, the actual outage lasted only a very short period of time and did not result in any data destruction. However, it took a couple of hours to bring up the various applications.

The moral of the story

In hindsight, it’s easy to say that mismatching firmware on anything — regardless of what the docs do or don’t say — is asking for trouble. It’s also easy to say you shouldn’t put your coffee mug down on your laptop when you’re jacked into the storage brain of a 1,500-seat enterprise network. Both pieces of advice are indisputably right on the money.

But that misses the bigger picture. The chances that a fully redundant, enterprise-class storage network is going to die all on its own are extremely slim. Yes, it happens, but it’s a rarity. If your storage is going to fail, it’s much more likely to be a result of human error — you, me, someone in your department, or a vendor you bring on site will be the one who suffers a simple lapse in judgment that will have far-reaching consequences.

In a time when storage vendors are striving to make their systems easier to manage by presenting attractive, user-friendly management interfaces, it’s easy to forget that you’re fiddling with increasingly complex software and hardware spaghetti — which together shoulders the weight of your entire IT infrastructure. Never forget that you’re an errant mouse click, a misplaced coffee mug, or a “sure, that’ll probably work” away from a terrible day at the office.

Regardless of how fancy they may look or feel, the mission-critical IT systems of today deserve the same fear and respect as did the systems of our IT forefathers.

This article, “A tale of two SAN failures,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

CareersData ManagementTechnology Industry

Topics

About

Policies

Our Network

More

A tale of two SAN failures

Enterprise-class storage systems are incredibly fault tolerant, but even the most reliable systems can fail in the hands of users

More from this author

The InfoWorld guide to disaster recovery done right

How to survive the data explosion

Review: Dell Compellent storage delivers flash speeds at disk prices

5 must-have capabilities for every monitoring system

The secret to troubleshooting performance problems

Cloud audits often don’t mean what you think they do

Where cloud backup fits the bill

What you need to know about today’s SSDs

Show me more

Claude Code AI tool getting auto mode

PyPI warns developers after LiteLLM malware found stealing cloud and CI/CD credentials

Cloudflare launches Dynamic Workers for AI agent execution

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)