matt_prigge
Contributing Editor

Antivirus: The silent virtualization killer

analysis
Feb 27, 20127 mins

Traditional guest-based antivirus can sap the life out of your virtualization and storage infrastructure -- learn how to recognize the problem

Life in IT is full of onerous tasks. Along with making good backups and maintaining a solid patching regimen, you must ensure that multiple levels of antimalware software are properly deployed. Unfortunately, in heavily virtualized environments, antivirus can go beyond being a pain to manage and actually become a threat in and of itself. As the saying goes, sometimes the cure is worse than the disease.

That antivirus software can slow down a machine probably comes as no surprise to anyone. Any software that watches each and every disk I/O and inspects it for threats adds overhead that didn’t previously exist. In most cases, this manifests itself through marginally higher disk latency and greater CPU load. But with careful use of scanning exclusions (for heavily used databases and the like), it’s usually not enough to bring a system to its knees.

Recently, however, I’ve been presented with two excellent examples of how antivirus run amok can have enormous sitewide impact — and how it can be difficult to detect the cause unless you know to look for it and have the monitoring data necessary to do so.

The new VDI environment

In the first instance, a client was in the process of bringing a new VDI environment into production. The base image had been fully tested, and the user base was excited to get rid of their ancient desktops and take advantage of the session portability that VDI would give them. Initial user testing had gone well, and no problems were detected.

However, as larger numbers of desktops were automatically deployed and user count expanded, performance started to suffer. First, things were a bit sluggish for everyone, but as the rollout proceeded, it became dramatically worse to the point that users eventually started to miss those old desktops. Initial investigation on the virtualization hosts didn’t show any significant CPU or memory contention, so attention quickly turned to the SAN.

Digging through the management interface of the SAN, it immediately became clear that the problem was indeed storage related, with latencies peaking well above 20ms. As is often the case in such situations, the fear that the SAN wasn’t up to the job of serving a VDI environment started to build.

Fortunately, the troubleshooting process didn’t stop there. Further investigation into the SAN load revealed than an average of 40 IOPS was being generated by each and every VDI desktop for about an hour after booting — far outside of the norm and much higher than what had been seen during initial image development and testing.

Eventually, it was determined that as the nonpersistent desktop images booted, their antivirus agent was sourcing new virus definitions from the management server, then performing a full system scan to ensure that no newly detectable risks had gotten through. This is a commonly used and perfectly reasonable approach in a physical desktop environment, but in a virtualization environment, it resulted in nothing short of the brutal murder of the underlying shared storage hardware.

As a test, deployments of new AV definitions were disabled (on the AV platform in use, it was impossible to disable the automatic scan). The result was a tenfold decrease in disk I/O during the early morning hours as new users logged in and new desktops were deployed — effectively working around the problem, although it left the task of manually updating the signatures in its wake.

The crashed SQL cluster

In another instance, a client reported that a mission-critical application had become unresponsive. Initial troubleshooting showed that the highly redundant clustered database services had actually gone offline. After the services had been restarted and service restored, investigation of the SQL server logs showed that the database services had experienced a period of extremely high disk latency, which culminated in the service giving up and terminating.

Inspection of the SAN itself showed no unusual events or failures. It seemed at first the problem might be isolated to the database cluster that experienced the problem, but since the issue was no longer in play, it was difficult to figure out what had actually happened. Worse, historical performance logging wasn’t available from the SAN infrastructure, so further investigation of the occurrences from the SAN’s perspective wasn’t possible.

Since the physical database cluster shared physical disk resources with a large virtualization infrastructure (which had extensive performance reporting capabilities), attention turned there to see if the virtualization environment also saw higher latency during the failure. Indeed it had — with incredible latency spikes at precisely the same time.

That wasn’t all. Not only had the virtualization environment seen the same latency, it was also generating approximately 200MBps of I/O per host across a cluster of eight hosts — an I/O load well in excess of what the storage back end could handle gracefully. Further investigation showed about half of the 200 production VMs started directing massive amounts of I/O at the SAN at precisely the same time and finished doing so about 15 minutes later.

In the end, it turned out that a configuration problem on the antivirus management server had resulted in a large number of the VMs reverting to an unmanaged state in which they would source their own antivirus definitions and, as in the VDI example above, perform an immediate system scan upon applying them. The load produced by 100 virtual machines on high-performance virtualization hosts all thrashing their disks as fast as they could was more than enough to bring the SAN to its knees.

Solutions exist

Fortunately, there are good solutions to these problems. Chief among them is migrating to an antivirus platform that can run within the hypervisor stack rather than within the guest operating system. Though there are now several products in the market, the first to leverage the VMware Endpoint Security (EPSEC) APIs present in vSphere was Trend Micro with its Deep Security product. These systems work by integrating a scanning engine into the disk I/O pathway within the hypervisor and managing it from a centralized virtual appliance that controls policies such as signature updates.

This approach results in lower I/O latency and host CPU load, and it handily avoids the types of antivirus-induced disk storms that were central to both of the incidents I outlined. While a guest antivirus agent is operating oblivious to the fact that many other similar agents might be running next to it on the same host, hypervisor-based agents can scan the VMs serially — ensuring that antivirus storms aren’t created and overall security is maintained.

Even if you can’t afford to run out and replace your antivirus platform, you should learn to recognize what an antivirus storm looks like — and the kinds of effects it can have on your virtualization and storage infrastructures. Though a hypervisor-integrated solution will yield the best results, careful management of centralized antivirus software policies and schedules can avoid some of the worst effects.

Perhaps the most important takeaway from both experiences is that without good historical monitoring capabilities, it’s almost impossible to differentiate between a slow SAN and an unusual disk load brought about by antivirus software (or anything else). If you do nothing else, make sure you provide yourself with as many sources of monitoring data for as much of your infrastructure as you can.

This article, “Antivirus: The silent virtualization killer,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.