VMware VDC-OS to bring fault tolerance to VMs

analysis
Sep 18, 20084 mins

Marathon Technologies, on the other hand, already has such a fault-tolerance solution available for XenServer

While at VMworld 2008, VMware announced a new fault-tolerance feature that is going to be included in the new Virtual Datacenter Operating System (VDC-OS).

VMware said that “Fault Tolerance is a groundbreaking new application service that provides zero downtime, zero data loss, availability to all applications against hardware failures without the cost and complexity of hardware or software clustering solutions.”

[ For more news from VMworld 2008, see the related story “VMware road map emphasizes manageability.” ]

Raghu Raghuram, VMware vice president of products and solutions, explained that VMware Fault Tolerance works by involving two VMs, a primary and a secondary VM. Using record/replay, VMware FT keeps two VMs in lockstep. The secondary VM is made an instruction mirror of the first VM, but on a different physical server.

In the event of a hardware failure, the secondary VM will take over for the primary VM, and assumes all network connections. When this happens, there shouldn’t be any interruption in the virtual machine’s network connectivity, much like the capability with VMotion.

This functionality may sound familiar to many of you. Remember back to VMworld 2007 in San Francisco during Mendel Rosenblum’s keynote presentation where he spoke about the future of “Continuous Availability”? Well, the future isn’t quite here yet, but it does seem closer to reality now. And it even got a new name.

Matthew Clark, director of computing infrastructure, Qualcomm, has already started using the new feature. He said, “VMware Fault Tolerance is making previously very high-end, expensive and complex functionality truly accessible. It can be turned on with a mouse click and you get the same result that fault tolerant hardware or software clustering would give you at a fraction of the cost and effort.”

Again, this technology may sound very familiar to you. No, not just because Dr. Rosenblum talked about it a year ago, but because someone is already selling this technology on top of another virtualization platform. Marathon Technologies has been providing this type of technology for a while now on top of the Xen platform. In fact, almost a year ago to the day at VMworld 2007, Marathon was showcasing its fault-tolerant lockstepping product, everRun, for virtual environments.

[ See the InfoWorld Test Center reviews of Always on Virtualization ]

In response to VMware’s new announcement around fault tolerance and lockstepping, Marathon Technology’s senior director of corporate communications, Brian Mullins, said on his corporate blog that what he saw wasn’t bad for FT rookies. He described it as “a less than perfect solution for a lot companies that want to run business critical and mission-critical applications in VMs.”

Mullins said that VMware’s solution falls short and gives four primary reasons why.

“The most common failures that result in unplanned downtime are component failures such as storage, NIC or controller failures. Yet VMware Fault Tolerance doesn’t do anything to protect against I/O, storage or network failures. By not addressing these primary sources of failures, VMware appears to be saying that you/the customer are on your own do figure out how to protect your storage and network connections. This may be okay for the very largest IT staffs in the world, but for the other 98%; it will not be sufficient.”

And then there is a bit of complexity involved. Mullins stated, “In order to use VMware Fault Tolerance, you’ll first have to install both VMware HA and DRS. No small feat in and of themselves. Then, because VMware FT requires NIC teaming, you’ll also have to manually install paired NICs. Then you’ll need to manually setup dual storage controllers (with the software to manage them) because it requires multipathing. And to top it all off, you’re required to use an expensive, and often complicated, SAN.”

Mullins also said there was an issue with limited CPU fault tolerance. “With VMware FT, you’ll need to setup what VMware refers to as a ‘record/replay’ capability on both a primary and secondary server. If something happens to the primary server, the record is stored on the SAN and then restarted on the secondary server. Two things to point out here. First, the whole thing depends on the quality of the SAN. Second, in the words of the VMware engineer who presented at VMworld, “this can take a couple of seconds.” So what happens to your application state in those couple of seconds?”

And finally, Mullins brings up the limitation that VMware FT will only work in VMware environments, and it can’t be used for business-critical and mission-critical applications that are kept on physical server platforms.

Marathon doesn’t appear to support VMware, but it does support Citrix XenServer. And the product is available today.