Recently we added another host to our VMWare cluster. During testing we discovered a few virtual machines would not vMotion between any of our hosts. It’s not clear if this had anything to do with adding the host or if it was something we only noticed because we were moving guests between hosts, which as a general rule does not happen that often.
To set the stage from the hardware side of things:
- One VMWare cluster, with (5) Dell hosts: (2) 2950s, (2) R720s, and (1) R730
- Shared storage on a FalconStor NSS SAN, connected via a Brocade 6505 FC switch.
- Originally we had a mix of ESXi 5.1 and 6.0, but we’ve since upgraded to all 6.0; the problem didn’t change between versions.
Fairly standard stuff.
When a vMotion would fail, we would get one of a few different errors, most of which were along the lines of:
After that, the VM would show either “(orphaned)” or “(inaccessible)” in vCenter:
Often, the same VM would reappear at the same time as a “discovered virtual machine,” which showed normally in vCenter but still could not be moved between hosts. It also showed the wrong disk space usage, often showing 4GB or less on servers with 60GB or larger drives. Note the guest OS was still running without issue, and the VM could be shut down and started back up without any errors.
We quickly established that this was not limited to a specific data store or specific host since there was no commonality between the guest VMs that had a problem. We also ruled out a storage issue as we have quite a lot of other VMs on that SAN as well as a Windows cluster, all of which was running fine.
Our first troubleshooting step was to remove the guest from vCenter inventory and attempt to re-add by browsing the datastore for the .vmx file. This resulted in two VMs being imported, as shown:
The first VM – i.e. the one without the (1) – would not show any data.
The second copy of the VM would show mostly accurate data:
Note the provisioned and used storage numbers, which are clearly incorrect.
After a while, the registration of the first copy of the VM would fail with an error and no longer show in vCenter.
The second copy of the VM would remain in vCenter but any attempt to vMotion to another host would result in an error (shown below), and the VM would then show (orphaned) in vCenter.
We were back where we started and contacted support.
Support was not terribly useful. I’m not looking to turn this into a rant about VMWare support, so I’ll skip a lot of the details but the short version is that while they did fix this for two of our VMs, they did so in an extremely inefficient manner that involved duplicating the entire VMDK file. So when the same problem appeared on two more VMs I wanted to investigate this further in-house.
Below is the fix we came up with, which is a little tedious but effective and not too time consuming.
- SSH to one of your hosts.
- Navigate to /vmfs/volumes/datastorename/vmfoldername
Run ls –l for a directory listing
It’s a good time to review a few things about VMWare file names:
name.vmx is the virtual machine definition file, which is plain text, and contains info about the VM itself
name.vmdk is the hard drive definition file, which is plain text as well, and points to the hard drive data file
name-flat.vmdk is the actual hard drive data
In some cases, either of the definition files could be missing (which would cause a problem), but they were both there for us.
- Keep the SSH session open, and use WinSCP to connect to the same host then navigate to the same folder.
- Use WinSCP to take backup copies of the base vmdk file and the vmx file.
- Open the base vmdk file and take note of the “ddb.adapterType” shown, as you will need this later. Also check to see if “ddb.thinProvisioned” is present, and its value if so.
Rename the base vmdk file and the vmx files; we’re going to replace both: (This is shown via SSH)
mv –i name.vmdk name.vmdk.old
mv –i name.vmx name.vmx.old
- Follow this link to recreate the vmdk file. (VMWare KB article 1002511) Note that in step 8, when you need to edit the vmdk file, it’s easiest just to do this directly from WinSCP which is why I suggest using both it and SSH simultaneously.
Now we need to deal with the vmx file. See this link. Here’s where things get even stranger. Following the post linked to, you select the existing vmdk file with the datastore browser, so the file is clearly there. Yet when you finish the wizard and vCenter goes to create the VM, we ended up with an error that the vmdk file wasn’t found.
Well that makes no sense; the file is clearly there. It’s in the datastore browser and it’s visible through the CLI and WinSCP. It was at this point I noticed something strange. I realize it’s hard to tell with the obfuscation, but there’s a slash “/” missing between the folder name and the filename. Huh.
I went through the wizard again and paid more attention to what’s shown for the disk file path, after selecting it in the datastore browser:
To be clear, the brackets  are just part of our naming convention. Notice the total lack though of slashes. I took a shot in the dark and added a slash between the folder and file name, then finished creating the VM and crossed my fingers.
The good news is manually adding the slash fixed the “file not found” error. But why should things get normal now? vCenter again showed two copies of the VM as discovered. Again, the second copy appears operational and the first shows no info. vCenter meanwhile hung trying to create (presumably) the first copy of the VM. Eventually it will timeout.
Note the second (working) copy of the VM did show a MAC address conflict which appears to be a harmless side-effect of the inexplicable duplication of VMs. I was able to boot the second copy of the VM. You may need to reconfigure the NICs from within the guest OS since in some cases new NICs may be ‘added’ to the system.
- Once this was all done, we had a fully-functional guest VM that could be migrated from host to host without any issues.
Hopefully this problem doesn’t reoccur, but at least we have a viable fix for it if it does. I’d be curious to hear if anyone else knows what may cause this in the first place.