matt_prigge
Contributing Editor

VMware snapshots: The good and the bad

analysis
Aug 16, 20107 mins

VMware's virtual machine snapshot technology is exceptionally helpful, but it does have serious limitations

Disk snapshots are one of the best data-protection mechanisms available in the data center today. Whether it’s a SAN-based snapshot technology or one built into your file system or operating system, maintaining a solid snapshot regimen can really save your bacon when things go wrong. However, not all snapshot technologies are built to serve as direct hedges against unintentional data loss. A great example of this is the snapshot technology built into EMC VMware’s vSphere ESX server and desktop virtualization platform.

Unlike other snapshot technologies, VMware’s snapshots are not well-suited to being used as a data protection methodology. Trying to use them that way — creating and maintaining periodic snapshots of a production server virtual machine, for example — will result in terrible I/O performance as well as significantly increased likelihood that the snapshots themselves will become a risk to production. However, that’s certainly not to say that this snapshot technology is without purpose. VMware’s snapshot capability is an exceptionally useful tool; you just have to know where and how to use it.

Essentially, there are two things these snapshots are really good for: 1) allowing a virtual machine’s disks to be isolated from write activity so that they can be backed up, and 2) providing a short-term failback during patching and software upgrades. In development environments, you’ll see snapshots used extensively to maintain many point-in-time images as changes are made. However, using snapshots in this manner in a performance-sensitive production environment is rarely advisable.

The reasons for this have their foundation in the nitty-gritty of what VMware’s snapshot technology actually does. Understanding this is the key to being able to use them effectively.

In a typical environment, a virtual machine’s disk resources are comprised of large files called VMDK files (a loose acronym for virtual machine disk) that are located on a VMware-proprietary VMFS file system or NFS store. Although there are several ways to provision these files, they are generally equal in size to the disk resource you’re presenting to the virtual machine. So, if you have a Windows 2003 server with a 15GB system drive, you’d expect to find a corresponding 15GB VMDK file hanging out on a VMFS volume.

When you ask the hypervisor to take a snapshot of the virtual machine using that disk, it will create a second VMDK file (sometimes called a redo log) next to the first. That file stores all the writes that the virtual machine makes after the snapshot is taken. Having this second VMDK file absorbing the writes for the virtual machine has several important implications.

Primarily, because the snapshot VMDK is absorbing the writes for the base VMDK, the base VMDK isn’t actively being changed. This means that if something goes wrong after the snapshot is taken, you can effectively throw it away and “revert” to the base disk — starting over where you were before the snapshot.

Similarly, because the base VMDK isn’t being written to while the snapshot exists, it’s safe to make a copy of it and be confident that you’re getting a consistent copy of what the base disk looked like before the snapshot was created. This is a critical part of nearly every third-party backup software’s ability to obtain direct-from-SAN backups of VMware virtual machines.

However, there are a few downsides to this that are equally important to recognize.

For instance, when a virtual machine performs a read operation while a snapshot is present, the hypervisor must check the snapshot VMDK to see if the requested block has been changed and is present there. If it has, it will provide the virtual machine with the block from the snapshot. If it hasn’t, it will read it from the base VMDK. With just one snapshot, this isn’t a huge deal — at worst, duplicating read operations for blocks that haven’t been changed. However, if you apply a large number of snapshots — say one a day for two weeks — the hypervisor could need to check 14 snapshot files for the requested block. This constitutes an enormous read I/O penalty and is one reason why these snapshots can’t effectively be used as a standalone data-protection mechanism in the way that SAN snapshots can.

Worse still, there is no comprehensive mechanism to prevent these snapshots from growing out of control. If you took a snapshot of that Windows 2003 server once a day for two weeks on a machine with an average disk turnover of 2GB per day (which is highly conservative for most virtual machines), you’d essentially double the amount of space the virtual machine was using as you retained all those changes.

That’s not so bad in itself — almost any snapshot technology, SAN or otherwise, consume snapshot space in a similar manner. What’s different is that there isn’t any automatic protection in place to prevent you from filling the VMFS volume that these disks live on (most SANs delete the oldest snapshot when the pre-allocated snapshot space has been consumed). In the event that the VMFS volume does fill, the current snapshot file won’t be able to accept writes, and the virtual machine will promptly crash. In an instance where you have a large number of virtual machines with snapshots all sharing the same full VMFS volume, they’ll all crash. Not a good discovery to make over your morning coffee.

A third very common potential pitfall is related to the hypervisor’s attempt to make sure that the snapshot is application consistent — a process called disk quiescing. Although you can take a snapshot without quiescing the disks first, the result is that you’ll have what’s commonly referred to as a “crash-consistent” snapshot. If you revert to the base disk after a crash-consistent snapshot has been taken, the operating system and applications will behave as if the machine was unceremoniously shut off mid-operation. Quiescing the disk gives the operating system and applications a chance to find a good stopping point (finish writes, complete transactions, etc) just before the snapshot is taken.

Depending upon the operating system running within the virtual machine, disk quiescing can be accomplished in a few different ways. All of them require the use of VMware’s VMware Tools package that is commonly installed within the virtual machine’s operating system. If you’re running a recent Windows OS, the VMware Tools tries to use Microsoft’s Volume Shadow Copy Services (commonly shortened to VSS) to create a temporary pause in the virtual machine’s I/O so that VMware can create its own underlying disk snapshot. 

When this works, it’s excellent. However, in virtual machines where a high amount of transactional disk I/O or a database platform such as Microsoft SQL or Exchange are present, VSS can often require a significant amount of troubleshooting to get working properly. Worse still, VSS can work fine one day and fail the next — which can be frustrating when you’re depending on quiesced snapshots to get good backups of the virtual machine.

Despite the downsides, VMware snapshots can be an excellent tool when used properly — either as a short-term point-in-time copy before a major upgrade or patch cycle or as a means to allow third-party backup software to perform direct-from-SAN backups of virtual machines.

However, using them effectively requires a good understanding of how they work and can occasionally require some troubleshooting. Just be sure to go in with both eyes open and test thoroughly before you depend on VMware snapshots in mission-critical virtual machines.

This article, “VMware snapshots: The good and the bad,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in network storage and information management at InfoWorld.com.