Mystery solved: The case of the very bad Monday

analysis
Oct 23, 20073 mins

Symptoms can be misleading. Sometimes the best diagnosis comes from examining the bigger picture. July 1987: My first day on the National Support desk for Wang Labs in Canada, and I guess the rest of the team wanted to provide me with a trial by fire. I was assigned to work with a specific customer who had an intermittent problem where his departmental mini-computer would suddenly slow to a crawl every Monday mo

Symptoms can be misleading. Sometimes the best diagnosis comes from examining the bigger picture.

Over the course of the next several weeks, we studied every aspect of what happened on Monday mornings. Folks came in and logged on, checked their e-mail; certain users went into applications, others went for coffee. The rate of use was not significant, and there was normally much more activity happening by 10:30 or midafternoon. Moreover, there was much more activity on Tuesday through Thursday, but the machines experienced no slowdown on those days. All we could tell was that the system seemed to be suddenly doing a very large quantity of system-related disk I/O and no one could move forward while this was occurring.

We continued looking for something that was different on Mondays compared to any other day of the week but never seemed to find it, until one day we learned that every Sunday the systems were subjected to a systemwide backup. Also, at that time the operators would reorganize all of the indexed files on the system and release all of the free space in those files in order to save disk resources.

This information suddenly clarified everything. The following day, Monday, users would slowly log in to various applications and begin performing data maintenance. As any of the larger files would have records written to them, they would eventually exceed the 2K data block or index block that remained after all the free space was cut from the files. At that point, the Wang OS would assign a new extent to the file in the form of free space, and would then proceed to link each of the blocks in the new extent. This new extent would be equal to half of the existing file size, and in many cases that could have been 100,000 blocks or more. These I/Os happened before anything else could be done on the system and drove the system into the ground, as file after file would go through the exact same process.

I had the customer start by not removing free space from the files and sure enough, everything worked like a charm. We then had them rotate file maintenance over the course of the week, and move to a file-space-management model in order to keep a reasonable amount of free space in all of their data files. The problem never returned.

I learned an important lesson from this experience. Although most support issues can be looked at in terms of “what changed?” it does not necessarily mean that the change was directly tied to the symptom. Sometimes you have to broaden your search and remember that if you only search where there is light, you run a good chance of missing something important.

infoworld_anonymous

Since 2005, IT pros have shared anonymous tech stories of blunders, blowhard bosses, users, tech challenges, and other memorable experiences. Send your story to offtherecord@infoworld.com, and if we publish it in the Off the Record blog we'll send you a $50 American Express gift card -- and, of course, keep you anonymous. (Note that by submitting a story to InfoWorld, you give InfoWorld Media Group, its affiliates, and licensees the right to republish this material in any medium in any language. You retain the copyright to your work and may also publish it without restriction.)

More from this author