matt_prigge
Contributing Editor

Scenes from a disaster: An upgrade gone too far

analysis
Nov 23, 20096 mins

It's easy to miscalculate how a major upgrade will fare -- especially when you forget the importance of infrastructure

System administrator typing supercomputer hub disaster recovery plan on laptop to provide fast restoration of service, limiting damage and minimizing interruptions to normal operations
Credit: DC Studio / Shutterstock

Not long ago in a land not far away, a manufacturing company was poised to perform a major version upgrade of its most mission-critical application. The company relied on that application to manage the very core of its business. Without it, business could not be done. No orders could be shipped. The company wouldn’t know a loyal customer from a stranger.

So the company invested serious time and effort into testing the new version. Users had been cycling through training rooms full of test workstations for months before the go-live date. The application vendor, which was on hand to witness the first large-scale deployment of the new verison and fix bugs where necessary, had been an excellent partner throughout the process. The new functionality had been tested thoroughly and the user base was very excited about it.

The go-live planning was fairly complicated. The back end of the application was running on a database cluster using some of the best servers money could buy. The database itself lived on a highly redundant storage array that had been implemented during the last application upgrade a few years earlier.

The new application version would see significant structural changes to the way the data was stored in the database, which entailed a complicated data conversion process that had to be performed offline. Fortunately, this process had been tested several times as production mirrors were restored and converted from backups for the test environment. In addition, several libraries on the workstations had to be upgraded. Unfortunately, once those library upgrades were complete, it would be impossible to use the old client version.

To mitigate the failback risk, the migration had been planned down to the smallest detail. It was decided that the system would be brought down after second shift on a Saturday evening. Database backups would be made and the conversion process would begin. Perhaps eight to twelve hours later, testing would be performed by a group of power users. If everything looked good at that point, policies would be pushed to the network to initiate the client upgrades. Soon after, Sunday second-shift employees would be able to get into the new live system and the migration would be complete.

Finally, it was go-live time. Like clockwork, the system came down at 7 p.m. on Saturday and the conversion process began. Early the following morning, the migration team had completed its work. The database looked like it was ready to go and upgraded application servers had been deployed. User testing kicked into gear again and everything looked excellent. Then came the order to push out the new client software — and roughly 1,000 workstations started to receive the client updates.

The first sign that something was amiss came in the form of a help desk call from a power user on the shop floor. He was trying to update an order he was working on, but it was taking a long time to get from one screen to the next. It wasn’t a big deal, it just wasn’t anywhere near as snappy as he remembered it being from training. At this point only about 150 users had received the upgrade and logged in to the system. Members of the application team looked at each other with a growing sense of dread. It was as if the temperature in the datacenter had suddenly dropped 20 degrees.

Within the next two hours, it became clear that something had gone horribly, horribly wrong. As more users piled into the system, the slower and slower it became. The first thought was that some kind of network problem had arisen, so the network administrator started digging into that. At the same time, server admins started drilling into performance logs on the application servers and desktop techs were sent out to spot-check workstations. Investigations of the network didn’t turn up anything. The application servers and client workstations were also running as expected. No one thought the problem could possibly have anything to do with the database servers; they were fantastically oversized for what they were being asked to do. But it looked more and more like that was the only place left to look.

Six hours after the new version had gone live, the senior server administrator started sifting through the logs on the database servers. CPU and memory usage were significantly higher than they had ever been before the upgrade, but well within the capabilities of the overengineered server. Finally, she looked at the performance logs for the disk array — and it hit her like a ton of bricks. The disk array utilization was unbelievably high, so high that the array simply could not keep up with the load. Further analysis revealed that a set of complex stored procedures, all associated with the much-anticipated new functionality, were the culprits.

An emergency meeting was called nine hours after launch. Terrible performance notwithstanding, an entire shift’s worth of production data had been logged in to the new version. Moreover, a plan had never been devised for downgrading the client application libraries to the old version. Failback was simply not an option. A slow application was better than shutting the plant down for the two or three shifts in order to manually revert all of the clients to the old version.

The next week was the darkest period in the history of the company’s IT department. Production schedules were severely affected, customers were unhappy, and untold productivity was lost. The political fallout was massive. In the end, an expensive new high-performance storage array was implemented and production was able to continue on the new version.

*             *             *

The military likes to refer to this kind of situation as a “teachable moment.” In this disaster, there may be seven or eight things to learn. If your first reaction is that you should never perform an upgrade without an easily executable failback — or never allow yourself to be a software vendor’s guinea pig — you’re absolutely right. But there’s also a more specific lesson.

The root cause was the failure to gauge the impact of the new software on the performance of the server and storage architecture. If comprehensive load monitoring had been executed during the testing and training phases and extrapolated to simulate the load that a user base 40 times larger would exert, it would have become immediately apparent that the storage architecture would have to be modified to support the load. Never expect a software vendor to do this for you. They don’t know your environment like you do, and ultimately you’re the one whose job is on the line if it’s not done correctly.