matt_prigge
Contributing Editor

Storage load balancing with vSphere 5.0

analysis
Sep 26, 20118 mins

Storage DRS, released in vSphere 5, brings exciting new storage management features to the table, but watch for limitations

Among the many improvements in VMware vSphere 5.0 is a new means to load balance virtual machine datastores: the Storage Dynamic Resource Scheduler, or Storage DRS. Like its pre-existing compute load-balancing counterpart (plain old DRS), Storage DRS promises to make managing large virtualization environments easier by taking some of the guesswork out of virtualization storage provisioning. But proceed with caution: Storage DRS can result in unintended consequences that may prevent customers who could most benefit from it from being able to use it at all.

Storage DRS allows you to group together multiple datastores into datastore clusters. Those clusters can consist of either block-level VMFS volumes or NFS mount points — providing welcome support for NFS, which has been neglected in previous VMware storage tech releases. Once configured, virtual machines can be load balanced across all datastores in a cluster based on available capacity, datastore performance, or both. The load balancing can operate autonomously, or it can simply recommend when it thinks you should make a change and allow you to approve it — providing an easy way to determine how it will work in your environment before you let it off its leash.

Storage vMotion

Just as standard DRS makes use of vMotion to move virtual machines from one host to another, Storage DRS utilizes Storage vMotion to move virtual machines from one datastore to another. As with many other components of vSphere, Storage vMotion has also seen significant changes in vSphere 5.0.

Most important, Storage vMotion now makes use of a single-pass copy combined with a kernel-level mirror driver to synchronize mid-vMotion storage writes as they’re being made. Previous versions used an iterative copy mechanism that leveraged Change Block Tracking. While this CBT-based method worked much more efficiently than previous snapshot-based mechanisms, it still could take a very long time to migrate VMs with significant I/O loads. The new implementation does a more effective job. Storage vMotion also supports migrating virtual machines that have snapshots or linked clones associated with them — a feature not found in earlier implementations.

Capacity load balancing

The most straightforward aspect of Storage DRS load balancing is its management of available storage capacity across a cluster of datastores. Based on thresholds, you can define as you configure the cluster. Storage DRS will then attempt to keep the available space in each datastore roughly similar by recommending initial virtual machine placement for new virtual machines and moving existing virtual machines around as datastores fill. This can make using thin-provisioned disks much more attractive, allowing automation of the often time-consuming process of shuffling VMs around as their disks grow. It also takes a lot of the guesswork out of figuring out exactly where to put a VM when a large number of similar datastores are available.

Performance load balancing and SIOC

Load balancing based on performance is substantially more complex than the fairly simple capacity-based load balancing. Fortunately, VMware has done a relatively good job of allowing you to modify many of the parameters that influence how the performance load balancing works and see the results in the form of latency and throughput metrics. However, understanding what it’s doing requires a solid understanding of the underlying mechanisms largely underpinned by VMware’s Storage I/O Control or SIOC.

Making its first appearance in vSphere 4.1, SIOC allows you to set storage priorities on individual virtual disks, which influence how much transactional storage throughput each is allowed when storage resources are constrained. Critical to that is detecting whether resources are being constrained in the first place — a judgment based on whether the average latency for a given datastore has exceeded a defined latency threshold. Once that threshold is exceeded for even a few seconds, the number of I/O requests each VM is allowed to queue are constrained in proportion to its storage priority (storage “shares”). This effectively prevents a single VM from swamping a datastore with I/O and drowning out other potentially more important VMs.

Much of this same existing SIOC tech is used by Storage DRS with a few important additions. When Storage DRS is first activated, it injects a range of storage workloads onto the datastores and monitors SIOC statistics to get a rough idea of what kind of performance the datastores will be capable of under load. This provides Storage DRS with a way to recommend where to initially place a virtual machine.

After virtual machines are running, Storage DRS constantly monitors SIOC statistics to determine whether any of the datastores in the cluster are routinely latency constrained over a long period of time (16 hours by default). If they are, it will iteratively migrate virtual machines to other datastores to balance the load across all datastores.

It’s very important to realize that virtual machine migrations based on a detected performance imbalance are much slower acting than the SIOC-based constraints. This is primarily due to the fact that the act of migrating a virtual machine can produce a significant amount of storage load in and of itself, which would compound an already bad situation if it were used too aggressively.

The devil is in the details

At first glance, the load balancing capabilities of Storage DRS would seem to be an enormous help in just about any storage environment. For a large number of environments, that’s absolutely true. However, there are a number of significant caveats that may not be immediately apparent. They include a range of problems that arise as a result of vSphere’s lack of visibility into how the storage hardware is configured — a common problem that will probably result in significant changes to the way that storage is addressed and used in the future.

For example, since judgments on where to place and migrate VMs are based entirely on observed datastore latency, it follows that creating multiple datastores on the same virtualized disk array (or aggregate, disk group, RAID group, and so on) might have the effect of linking the transactional performance potential (and thus latency) of those datastores to each other in ways that Storage DRS can’t predict. Load balancing these kinds of datastores isn’t likely to be particularly effective. Fortunately, creating multiple volumes on a single group of physical disks has often been a function of the partition size limitations present in VMFS3 — which the introduction of VMFS5 has largely made unnecessary.

Storage DRS also doesn’t have visibility into other more advanced storage array technologies such as deduplication and automated tiering. In some cases, constantly shuffling virtual machines among multiple datastores can end up reducing the effectiveness of those features by constantly resetting block-based performance data that an array has built by observing I/O patterns over time or placing additional load on the inline dedupe engine used by some arrays.

However, the largest pitfall for Storage DRS can be found in environments where array-based replication is being used to implement site failover capability. Just about any storage array out there will see the Storage vMotion of a 500GB virtual machine as 500GB of new writes that need to be replicated to the other site — soaking up a massive amount of site-to-site WAN bandwidth in the process.

Even if you’re fortunate enough to have sufficient bandwidth to support that kind of replication turnover, environments using VMware’s Site Recovery Manager have even more to worry about. SRM doesn’t officially support the use of Storage vMotion on protected virtual machines since there are short windows during a migration when SRM won’t know where a virtual machine actually is — preventing adequate protection. SRM also doesn’t support vSphere Replication (a new vSphere/SRM 5.0 feature) in combination with Storage DRS. You can get some details on both of those SRM-related issues and potential work-arounds from this blog post by VMware’s Cormac Hogan.

Putting it all together

Despite its current limitations, Storage DRS represents a significant step forward in making virtualization storage environments self-managing. Where it can be leveraged, Storage DRS can replace a significant number of time-consuming storage management tasks, decrease administrator error, and result in better storage utilization. It also represents a continuing effort by VMware to stay ahead of its competition and justify its relative price premium — a factor that will become even more critical as Windows Server 8 Hyper-V hits shelves in the future.

This article, “Storage load balancing with vSphere 5.0,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.