Guest Post #
Info
This is a guest post by Espen Ødegaard, Senior Systems Consultant for Proact. #
You can find him on Twitter and LinkedIn. Espen is usually found in vmkernel.log, esxtop, sexigraf or vSAN Observer. Or eating, he eats a lot.
As VMware has not released a fix yet (regarding issues with SD card and USB drive), I’m still experiencing issues with ESXi 7.0 U2a Potentially Killing USB and SD drives, running from USB or SD card installs. As previous workaround (copying VMware Tools to RAMDISK with option ToolsRamdisk
) only worked for 8 days (in my case), I needed something more “permanent”, to get the ESXi-hosts more “stable” (e.g. host being able to enter maintenance mode, move VMs around, snapshots/backup, doing CLI-stuff/commands, etc.).
Stopping the “stale IOs” against vmhba32 makes ESXi happy again #
As I previously just janked out my USB (which also works, by the way), I needed something more remote-friendly. Mentioned other places, a combination of esxcfg-rescan -d vmhba32
, and restarting services/processes currently using the device (vmhba32), frees up the “stale/stuck IOs”, and ESXi is “happy again” (most things seems to be working fine, as VMware ESXi runs fine in RAM).
That said, any “permanent configuration changes” to ESXi, etc. will not work, as all IOs against the device (which stores the changes) are not working.. This includes trying to patch the host. The device is in other words marked as failed, with APD/PDL (which I’m guessing is why the host is somewhat working again; not trying any IOs against the vmhba32 device equals no timeouts. No timeouts equals working processes etc. (wild guess). A quick reboot seems to make the drive working again, luckily. But only until the issue resurfaces (hours or days, in my experience so far).
Checking the possible options for esxcfg-rescan
#
[root@esx-14:~] esxcfg-rescan -h
esxcfg-rescan <options> <adapter>
-a|--add Scan for only newly added devices.
-d|--delete Scan for only deleted devices.
-A|--all Scan all adapters.
-u|--update Scan existing paths only and update their state.
-h|--help Display this message.
Running esxcfg-rescan -d
on the device that has issues #
In my case, it’s vmhba32
[root@esx-14:~] esxcfg-rescan -d vmhba32
Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.
Check for any process (worlds) currently using the device #
[root@esx-14:~] localcli storage core device world list|egrep -ie '(device|mpx)'
Device World ID Open Count World Name
mpx.vmhba32:C0:T0:L0 1051918 1 hostd
mpx.vmhba32:C0:T0:L0 1424916 1 localcli
Here we see that device mpx.vmhba32:C0:T0:L0 is being used by hostd (with PID 1051918)
Tip: You may also just run the command localcli storage core device world list
, and check the output. I simply added a filter on device & mpx only, to limit output.
Restart hostd
, if needed (or any other processes locking the device) #
[root@esx-14:~] /etc/init.d/hostd restart
watchdog-hostd: Terminating watchdog process with PID 1051906 1051182
hostd stopped.
/usr/lib/vmware/hostd/bin/create-statsstore.py:30: DeprecationWarning: pyvsilib is replaced by vmware.vsi
import pyvsilib as vsi
hostd started.
Re-check if any process is still using the device #
[root@esx-14:~] localcli storage core device world list|egrep -ie '(device|mpx)'
Device World ID Open Count World Name
Note: After restarting the hostd
process in my case, I needed to wait another 2-3 minutes, sometimes, before the world was actually stopped, and the process was no longer using vmhba32 (guessing another timeout).
Results #
Commands like df -h
etc. should now work, and you may set the host in maintenance mode, completing vMotion & evacuating VMs as usual (which was stuck before), or do “CLI stuff”. Other procedures which might have failed before, now may start working again.
So after vmhba32 is “flagged as failed”, you may
- Enter maintenance mode, if needed.
- Evacuate VMs/vMotion, etc, as usual.
- Take snapshots of VMs.
- Pre-checks (scripts) work.
- Do CLI commands (which previouly got stuck).
- Reboot host.
Also: After VMware releases a fix for this, I simply plan to reboot the host first (makes the device working again) , then apply the patch, etc. Hopefully it won’t be that long, until a fix is released. For now, I’ll apply this “workaround” in my environment, which seems to be better than stale IOs againt the ESXi, including the repercussion (failing processes), and possible mutiple reboots needed, etc.
This is a post in the Guest Post series. Posts in this series:
- My Experience: VCP-DCV for vSphere 7.x 2V0-21.20 Exam —
- Leveling Up: My First VMUG Presentation —
- My First VMware Explore —
- Expired VMware vCenter certificates —
- Hot Add NVMe Device Caused PSOD on ESXi —
- Upgrading to ESXi 7.0 build 18426014 U2c. ESXi stuck in Not responding from vCenter —
- ESXi 7.0 SD Card/USB Drive Issue Temporary Workaround —
- Searching vCenter Tasks and Events via PowerShell and GridView —
- ESXi 7.0 U2a Potentially Killing USB and SD drives! —
- ESXi: Error Occurred While Saving Snapshot Msg.changetracker —
- The Curious Case of the Intel Microcode Part #2 - It Gets Better — Then Worse —
- The Curious Case of the Intel Microcode —
- Relax and virtualize it! —
- HP Proliant DL380p Gen8 "Decompressed MD5" error —
Related Posts
- Upgrading to ESXi 7.0 build 18426014 U2c. ESXi stuck in Not responding from vCenter —
- ESXi 7.0 U2a Potentially Killing USB and SD drives! —
- Hot Add NVMe Device Caused PSOD on ESXi —
- ESXi SD-Card/USB boot devices unsupported in 7.0u3 —
- VMware vSAN 7.0 Update 2 Announced —