Production system troubleshooting 101: it’s not always about technical knowledge

opinion

Jan 17, 20185 mins

Sometimes, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production

succession brain sharing intellectual knowledge sharing

One of the biggest misconceptions about troubleshooting systems is that it requires deep, specific technical knowledge to locate and solve production issues. This assumption can often result in extending the time between the discovery and resolution of a problem. At first this may seem counterintuitive, so let’s look at some common scenarios to see which concept is makes the most sense.

To start with, most assumptions about broad concepts are generally wrong because they are based on the expectation that there is a single, best way of doing things every time. There are certainly times when the developer of a particular solution can look at a problem a production application is having and instantly say, “I know why that is happening.” This happens not because the developer deliberately left an issue but because most solutions have multiple, valid approaches. Some of them can have flaws that may not be immediately obvious. In some cases, all options have flaws, and it is a matter of choosing the path with the weakness that is least likely to be found in the wild. The experienced developer will unconsciously be aware of these potential problems and, when presented with the issue in production, will instantly recognize it. In most cases, these things will surface and be addressed in QA before they reach production. By the nature of production systems (where users are always more inventive than the best QA analyst), the application will encounter something that was not anticipated.

Once in production, the key to identifying the cause of the problem is to look at what is happening, where the person with deep, specific knowledge will most likely first look for what is expected to happen. There lies the trap. If a reasonable QA effort was put in before release, it is what is unexpected that is more likely to be the issue. The easiest way to find an issue that isn’t immediately obvious is to have no expectations and instead observe what the behavior is and trace it back to its origin with no anticipation of what will be found. It is much more about applying a way of thinking than it is about knowing something in advance to find the root cause.

There is also the psychological aspect that can occur in having the original developer investigate the issue. For reasons that could fill another article (if not a whole book), the first thing the developer tends to look for is something outside their application as the cause. It is quite possible it is something from outside causing the issue. The more experience the developer has, the more likely this is the case. In troubleshooting, the goal is to fix the problem and having any assumptions at the start can delay finding the problem where ever it is. Yes, sometimes those intuitive assumptions are useful, so long as they are abandoned if they don’t quickly prove out.

When issue is determined to be outside the responsibility of the person or team investigating, the mistake most often made is to hand it off to another team before clearly understanding how the external system is causing the issue. Failure to articulate irrefutable evidence of the source of the issue before passing it on to those responsible for that part of the system to solve can result in an unproductive back and forth between developers or teams as they also expect it is not in their work.

Once the issue is identified, deep knowledge may still hinder resolution and will not always be necessary. I was recently asked to help with an issue where the production support team followed a recommendation from the cloud platform vendor support to address an issue with throttling by moving the offending process on premises in a hybrid solution. While platform support knows its platform well, with the myriad ways it can be implemented is just not possible to always anticipate how combinations will work out. The support team followed the advice without thinking about why that process was deployed to the cloud to begin with. The change resulted in new issues because there were insufficient resources in the on-premises server. Furthermore, when validating the change, it only looked at the cloud monitoring (where the problem originally manifested). The failure point had been moved to the on-premises system and it was the business that reported the new manifestation of the problem (and brought me in to help).

The final solution was to manage the iterations in the process being throttled to bring it within threshold limits. This required no knowledge of the cloud platform beyond that throttling was a factor, and no detailed knowledge of the specific implementation because the logs clearly pointed to where the failure was occurring, which was the point where the counter needed to be added to avoid the threshold.

To sum up the lesson, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production. During development, it is common to be stuck for a while solving a bug and to ask someone else to look at the problem with a fresh perspective. Carrying this process on into production will resolve issues faster and leave more time for working on the next cool iteration.

App TestingCareers

by Scott Nelson

Contributor

Scott S. Nelson is a professional services technology consultant focused on leading the implementation of highly integrated solutions in the cloud, on premise, and often with a web-enabled user interface on platforms such as Informatica Cloud, Salesforce.com and Oracle WebCenter.

He has been creating business value with technology since 1991 when he came into his marketing communications office one morning to find a pile of boxed components and a note from the company president saying “Please set up.” After all six 486 machines were networked and online at a brisk 2400 baud rate he configured a Microsoft Works Database to manage the sales process from lead rotation to order creation and became another “accidental IT guy.”

In the last 26 years Scott has lead the way in finding applications of technology to improve business processes and profitability. He has worked with companies as small as one person and as large as 200,000 employees. He has been an independent consultant and worked for small consultancies and within large software vendor professional services groups at BEA and Oracle. He strives for every solution he contributes to make it easier for users to get things done and application support teams to go home on time.

The opinions expressed in this blog are those of Scott S Nelson and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

Show me more

Topics

About

Policies

Our Network

More

Production system troubleshooting 101: it’s not always about technical knowledge

Sometimes, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production

More from this author

The future is cloudy, with a chance of success

Just because something can be done …

Maximize ROI with MVP

Too big to survive: There is no bailout for technical debt

Why you need to change your monolithic architecture

5 approaches to lower enterprise technology costs

Salesforce uncoded: the reality of code-free Salesforce communities

How Salesforce supports citizen development

Show me more

JetBrains launches AI coding agent management platform

New ‘StoatWaffle’ malware auto‑executes attacks on developers

An architecture for engineering AI context

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)