Background

As VMware has not released a fix yet (regarding issues with SD card and USB drive), I’m still experiencing issues with ESXi 7.0 U2a Potentially Killing USB and SD drives, running from USB or SD card installs. As previous workaround (copying VMware Tools to RAMDISK with option ToolsRamdisk) only worked for 8 days (in my case), I needed something more “permanent”, to get the ESXi-hosts more “stable” (e.g. host being able to enter maintenance mode, move VMs around, snapshots/backup, doing CLI-stuff/commands, etc.).

Stopping the “stale IOs” against vmhba32 makes ESXi happy again

As I previously just janked out my USB (which also works, by the way), I needed something more remote-friendly. Mentioned other places, a combination of esxcfg-rescan -d vmhba32, and restarting services/processes currently using the device (vmhba32), frees up the “stale/stuck IOs”, and ESXi is “happy again” (most things seems to be working fine, as VMware ESXi runs fine in RAM).

That said, any “permanent configuration changes” to ESXi, etc. will not work, as all IOs against the device (which stores the changes) are not working.. This includes trying to patch the host. The device is in other words marked as failed, with APD/PDL (which I’m guessing is why the host is somewhat working again; not trying any IOs against the vmhba32 device equals no timeouts. No timeouts equals working processes etc. (wild guess). A quick reboot seems to make the drive working again, luckily. But only until the issue resurfaces (hours or days, in my experience so far).

Checking the possible options for esxcfg-rescan

[root@esx-14:~] esxcfg-rescan -h
esxcfg-rescan <options> <adapter>
   -a|--add       Scan for only newly added devices.
   -d|--delete    Scan for only deleted devices.
   -A|--all       Scan all adapters.
   -u|--update    Scan existing paths only and update their state.
   -h|--help      Display this message.

Running esxcfg-rescan -d on the device that has issues

In my case, it’s vmhba32

[root@esx-14:~] esxcfg-rescan -d vmhba32
Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.

Check for any process (worlds) currently using the device

[root@esx-14:~] localcli storage core device world list|egrep -ie '(device|mpx)'
Device                  World ID  Open Count  World Name
mpx.vmhba32:C0:T0:L0    1051918           1   hostd
mpx.vmhba32:C0:T0:L0    1424916           1   localcli

Here we see that device mpx.vmhba32:C0:T0:L0 is being used by hostd (with PID 1051918) Tip: You may also just run the command localcli storage core device world list, and check the output. I simply added a filter on device & mpx only, to limit output.

Restart hostd, if needed (or any other processes locking the device)

[root@esx-14:~] /etc/init.d/hostd restart
watchdog-hostd: Terminating watchdog process with PID 1051906 1051182
hostd stopped.
/usr/lib/vmware/hostd/bin/create-statsstore.py:30: DeprecationWarning: pyvsilib is replaced by vmware.vsi
  import pyvsilib as vsi
hostd started.

Re-check if any process is still using the device

[root@esx-14:~] localcli storage core device world list|egrep -ie '(device|mpx)'
Device                                                                    World ID  Open Count  World Name

Note: After restarting the hostd process in my case, I needed to wait another 2-3 minutes, sometimes, before the world was actually stopped, and the process was no longer using vmhba32 (guessing another timeout).

Results

Commands like df -h etc. should now work, and you may set the host in maintenance mode, completing vMotion & evacuating VMs as usual (which was stuck before), or do “CLI stuff”. Other procedures which might have failed before, now may start working again. So after vmhba32 is “flagged as failed”, you may

  • Enter maintenance mode, if needed.
  • Evacuate VMs/vMotion, etc, as usual.
  • Take snapshots of VMs.
  • Pre-checks (scripts) work.
  • Do CLI commands (which previouly got stuck).
  • Reboot host.

Also: After VMware releases a fix for this, I simply plan to reboot the host first (makes the device working again) , then apply the patch, etc. Hopefully it won’t be that long, until a fix is released. For now, I’ll apply this “workaround” in my environment, which seems to be better than stale IOs againt the ESXi, including the repercussion (failing processes), and possible mutiple reboots needed, etc.