VMware ESX 3.5 Update 2 bug causes morning jitters

analysis
Aug 13, 20084 mins

VMware customers woke up this morning to one of their worst nightmares: Virtual machines on their ESX or ESXi 3.5 servers with Update 2 would not power on or VMotion because of a bug in the update software.

According to early morning user reports on VMware’s forums, a bug now identified with VMware’s ESX 3.5 Update 2 patch has kept many corporate users from being able to power on their virtual machines today.

After numerous questions were posted on the forums asking for help, VMware confirmed the problem as a bug in the recently released Update 2 software for ESX 3.5 and ESXi 3.5 hypervisors.

The flaw caused a general alarm in the community and raised the red flag of fear that many people have expressed over the years around server virtualization: “If I have multiple virtual machines running on a single host server, if something happens to that one host server, I now have a problem with many machines and not just one.” For years, VMware has touted the stability of its ESX hypervisor, in many cases showing screenshots of ESX host servers with an uptime of more than a year.

However, today many corporate users woke up to one of their worst virtualization nightmares: not being able to power on critical virtual machines even though the host server seemed perfectly fine at first or even second glance.

VMware users were quick to help each other diagnose the problem. They determined that it was a glitch in the licensing code somewhere, causing the system to incorrectly identify the host server’s license as being expired. And as such, the system could no longer power on a virtual machine or migrate them to another host.

VMware made an official response to the problem, saying, “An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.”

VMware has since taken down the Update 2 bits from the Web site to keep others from downloading and installing the problematic code. The company has identified the problem, and VMware engineers “are working around the clock to deliver updated builds and patches for impacted customers.”

VMware suggested the following workarounds:

1) Do not install ESX 3.5 U2 if it has been downloaded from VMware’s website or elsewhere prior to August 12, 2008.

2) Set the host time to a date prior to August 12, 2008. This workaround has a number of very serious side affects that could impact product environments. Any Virtual Machines that sync time with the ESX host and serve time-sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems.

The company is working to produce an express patch for impacted customers today, and they plan to reissue upgrade media by 6 p.m. PST on Aug. 13 and all update patch bundles later in the week.

VMware also offered the following apology in their official response:

We are making improvements on all fronts. The product team had endeavored to deliver a release which support customers deem important. But we fell short and we are deeply sorry about all the disruption and inconveniences we have caused. We have identified where the holes are and they will be addressed to restore customers’ confidence.

VMware has not said how many customers were affected. But the problem could be seen as fairly widespread considering the number of people who have viewed the problem discussion on VMware’s forum — almost 26,500 people as of this evening, with 500 replies to the original message. VMware technical support was inundated with calls, and even the KB article about the problem experienced time-out problems because of the heavy traffic to the knowledgebase.