Mystery solved: The case of the very bad Monday

analysis

Oct 23, 20073 mins

Symptoms can be misleading. Sometimes the best diagnosis comes from examining the bigger picture. July 1987: My first day on the National Support desk for Wang Labs in Canada, and I guess the rest of the team wanted to provide me with a trial by fire. I was assigned to work with a specific customer who had an intermittent problem where his departmental mini-computer would suddenly slow to a crawl every Monday mo

Symptoms can be misleading. Sometimes the best diagnosis comes from examining the bigger picture.

Over the course of the next several weeks, we studied every aspect of what happened on Monday mornings. Folks came in and logged on, checked their e-mail; certain users went into applications, others went for coffee. The rate of use was not significant, and there was normally much more activity happening by 10:30 or midafternoon. Moreover, there was much more activity on Tuesday through Thursday, but the machines experienced no slowdown on those days. All we could tell was that the system seemed to be suddenly doing a very large quantity of system-related disk I/O and no one could move forward while this was occurring.

We continued looking for something that was different on Mondays compared to any other day of the week but never seemed to find it, until one day we learned that every Sunday the systems were subjected to a systemwide backup. Also, at that time the operators would reorganize all of the indexed files on the system and release all of the free space in those files in order to save disk resources.

This information suddenly clarified everything. The following day, Monday, users would slowly log in to various applications and begin performing data maintenance. As any of the larger files would have records written to them, they would eventually exceed the 2K data block or index block that remained after all the free space was cut from the files. At that point, the Wang OS would assign a new extent to the file in the form of free space, and would then proceed to link each of the blocks in the new extent. This new extent would be equal to half of the existing file size, and in many cases that could have been 100,000 blocks or more. These I/Os happened before anything else could be done on the system and drove the system into the ground, as file after file would go through the exact same process.

I had the customer start by not removing free space from the files and sure enough, everything worked like a charm. We then had them rotate file maintenance over the course of the week, and move to a file-space-management model in order to keep a reasonable amount of free space in all of their data files. The problem never returned.

I learned an important lesson from this experience. Although most support issues can be looked at in terms of “what changed?” it does not necessarily mean that the change was directly tied to the symptom. Sometimes you have to broaden your search and remember that if you only search where there is light, you run a good chance of missing something important.

Data Management

by InfoWorld Anonymous

Follow InfoWorld Anonymous on X

Since 2005, IT pros have shared anonymous tech stories of blunders, blowhard bosses, users, tech challenges, and other memorable experiences. Send your story to offtherecord@infoworld.com, and if we publish it in the Off the Record blog we'll send you a $50 American Express gift card -- and, of course, keep you anonymous. (Note that by submitting a story to InfoWorld, you give InfoWorld Media Group, its affiliates, and licensees the right to republish this material in any medium in any language. You retain the copyright to your work and may also publish it without restriction.)

Show me more

PyPI warns developers after LiteLLM malware found stealing cloud and CI/CD credentials

By Shweta Sharma

Mar 25, 20264 mins

Development ToolsSecuritySoftware Development

Topics

About

Policies

Our Network

More

Mystery solved: The case of the very bad Monday

More from this author

New networking hire’s lazy habits pull down the team

DDoS or bust! Angry gamers come for IT

Hack the server room! No tech required

A new manager learns the power of no

Life lessons: An intern earns his admin privileges

Double-crossed: A decade in IT goes down the drain

Execs know best? Not at these companies

Stupid wireless mouse trick mesmerizes CFO

Show me more

PyPI warns developers after LiteLLM malware found stealing cloud and CI/CD credentials

Cloudflare launches Dynamic Workers for AI agent execution

Speed boost your Python programs with new lazy imports

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)