Know your options for infrastructure monitoring

As the number and variety of monitoring tools grow, it's getting harder to track which metrics to collect and why. Ernest Mueller of CopperEgg walks you through the decision points

Network monitoring is the nervous system of any infrastructure. Keeping tabs on your services — whether they’re local or in the cloud — is vitally important to maintaining a stable and functional service infrastructure.

In this week’s New Tech Forum, Ernest Mueller, product manager at infrastructure-monitoring provider CopperEgg, walks us through the growing field of network and service monitoring, and what we can now leverage to keep tabs on our ever more complex infrastructures. — Paul Venezia

If an app slows in Singapore, does ops in Palo Alto know why?

One of the more fascinating points about multi-million-dollar IT systems businesses is that most of its operations are invisible to the naked eye. You may be sitting in a server room, but still have no idea what workloads are being performed, whether services are up or down, or what performance is like.

That’s why we rely on instrumentation — the monitoring tools that offer a deep view of what’s going on inside this hidden world.

There are many different ways to instrument your applications and systems to get a view into what they’re doing. But as with the real, physical world, what you’re observing and how you perceive the events in question affect the conclusions you derive.

As an example, with a real user-monitoring tool, you may see that a critical application is working well for your users, who are based in the United States. But a synthetic probe shows there’s an issue with access to your site from Singapore, and it will affect users in several hours when that office comes online. Neither form of monitoring is “better,” per se — they’re simply instrumenting your system in different ways and therefore showing different levels of information. To have the most comprehensive view of your site, you’d want both.

Let’s explore the many ways an organization can acquire and digest monitoring data and the implications of each.

Instrumentation techniques

The instrumentation technique is largely defined by two factors: The method by which the instrument takes a sample and the point in the web of paths and dependencies within the system where the sample is taken.

Instrumentation methods are typically divided into passive and active approaches. Passive instrumentation tries to record data about what is going on in the system without affecting it (though the observer effect is difficult to remove entirely). Active instrumentation specifically provides a stimulus to provoke a response.

Whichever approach is used, the other differentiator is depth: what type of information is captured in a sample and how rich it is. For example, a synthetic Web transaction can simply capture a response code and total time taken for the request, or it can capture an entire “waterfall” of performance of every component and how long results took to render in a real browser. A sample could be taken every minute or every hour.

The instrumentation points where you collect data can be on a server, within an application, on the network, in a client browser, or at any other part of the service. Which points you choose to instrument affect what behavior you have direct data on.

For example, if you instrument the JVM (Java virtual machine) that an application runs in, you know specifically what that JVM is doing — and which behaviors you are only able to infer. (The JDBC call is slow, but is that because of the network, the database, the database server, or something else entirely?) It’s best to choose your points of instrumentation carefully, so you see the high-level state of your system and its delivery to real users, but also dive deeper to effectively isolate and troubleshoot faults or slowdowns.

Available tooling

Modern IT systems are complex, so a host of tools has sprung up that can instrument specific points within those systems using a variety of methods. Here’s a breakdown of the most common tool instrumentation approaches you can apply to a typical three-tier Web system:

Browser RUM (real user monitoring) using JavaScript instrumentation embedded in Web pages that sample information on the user experience. RUM can capture actual user experience across a wide part of the service, but its scope is limited to what your users are currently doing — it also generates a lot of data. Web analytics are closely related to browser RUM. Pure API traffic isn’t captured, and mobile apps typically require a separate implementation.
Global probes generate synthetic Web transactions applied to the system from various external geographic locations. These have the benefit of repeatedly testing the service in the same way from various points and provide great performance-over-time information that can be used effectively for measuring SLA attainment. However, they can’t exercise all parts of the service and generate load on the service in the process.
Network RUM is based on network capture of user traffic on the server side. It’s limited to when you have physical access to the network, and it doesn’t see activity from browser or CDN caches. However, it can see more protocols than Browser RUM and doesn’t suffer from browser compatibility issues.
Local probes apply synthetic Web transactions to the system from inside the service network. They’re very actionable for alerting (if it fails, the service is most likely down), but do not cover the full chain required to deliver the service to an end-user. A variation of this applies a probe from onboard an individual system to services running on that system itself.
Network APM (application performance monitoring) analyzes the behavior of system components by watching the interchange on the network. It covers many protocols and provides insight into network-based performance issues, but it’s blind to the complexity that lies behind that IP address.
Database APM offers deep-dive analysis of database activity and performance statistics. It provides plenty of information, including database errors and performance issues, but does not expose issues across the other 90 percent of the stack. Also, support for the explosion in diversity of NoSQL/NewSQL data stores is a challenge.
Network monitoring of network devices and flows is important to identify and diagnose network problems, but it usually does not take into account the higher-level operation of applications, services, and users that is more meaningful to the business.

Those are merely the types of monitoring external to the servers and applications themselves. By decomposing a specific system, we can observe many more instrumentation techniques.

Let’s use the common pattern of a server (one of a cluster, most likely) with a Web server fronting a Java application server running several JVMs, each of which contains a couple Java applications:

Software platform metrics: Simple process uptime monitoring is the most basic method, but most Web servers, app servers, and other third-party components also surface metrics about their operation via a status page or other means. These provide another data point on uptime and performance, which helps isolate issues, but you’re limited to the specific metrics the software provider decided to provide.
App container metrics: Typically, these are Java JVM metrics via JMX or code instrumentation (or similar metrics on other platforms). These deliver excellent depth to find application issues at runtime, but there are thousands of fine-grained metrics that require some sophistication to understand.
Application metrics: These surface from inside the application itself using a metrics library. They’re very valuable because they are custom to the exact data you want to surface, whether it’s dollars sold in your online store or number of customers served — but your developers need to write code explicitly to surface them.
Hardware platform metrics: Here we’re talking about OS-level metrics (the ever popular CPU/memory/disk), the underlying abstraction layer, if any (for example, Amazon AWS metrics pulled using CloudWatch, virtualization layer metrics, or LXC container metrics for Docker users), and hardware metrics. They’re necessary to identify resource shortfalls and provide insight into many common issues, but may or may not be representative of the service experience in the real world.
Network metrics: These metrics are gathered by sniffing the interface on each server. Otherwise, they’re similar to the network APM technique discussed above.
Log aggregation: All of the above parts of the system usually dump records of events and metrics into log files, which are an alternate path to gather much of the same information. Log information is often richer than pure metrics, but it’s also large in volume and often slower to collect and process for rapid information.

Because each type of instrumentation has its strengths and weaknesses, users face a challenge to layer in types of monitoring that provide both proactive awareness of the end-user experience to minimize mean time to detection (MTTD) and to offer sufficient granularity and information to minimize mean time to resolution (MTTR) after an issue arises.

In pursuit of optimum results, it’s tempting to want one of each. But opting for every form of instrumentation raises cost — including licenses, labor, complexity of management, and the additional system load from active methods — to impractical levels. One solution is to adopt an integrated solution, such as that offered by CopperEgg, which wraps a mix of monitoring services into one complete portfolio at lower cost.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

This article, “Know your options for infrastructure monitoring,” was originally published at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Technology Industry

Topics

About

Policies

Our Network

More

Know your options for infrastructure monitoring

As the number and variety of monitoring tools grow, it's getting harder to track which metrics to collect and why. Ernest Mueller of CopperEgg walks you through the decision points

Show me more

How to land a software development job in an AI-focused world

The agent security mess

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)