ESXi 7.0 U2a Potentially Killing USB and SD drives!

by Espen Ødegaard · Read in about 8 min (1581 words)

Guest Post #

Info

Espen Ødegaard

This is a guest post by Espen Ødegaard, Senior Systems Consultant for Proact.

You can find him on Twitter and LinkedIn. Espen is usually found in vmkernel.log, esxtop, sexigraf or vSAN Observer. Or eating, he eats a lot.

Note

Workaround per 01. June 2021

As VMware has not released a fix yet (regarding issues with SD card and USB drive), I’m still experiencing issues with ESXi 7.0 U2a Potentially Killing USB and SD drives, running from USB or SD card installs. As previous workaround (copying VMware Tools to RAMDISK with option ToolsRamdisk) only worked for 8 days (in my case), I needed something more “permanent”, to get the ESXi-hosts more “stable” (e.g. host being able to enter maintenance mode, move VMs around, snapshots/backup, doing CLI-stuff/commands, etc.).

See ESXi 7.0 SD Card/USB Drive Issue Temporary Workaround for details.

After upgrading my 4-node vSAN-cluster (homelab) to ESXi 7.0 build 17867351 U2a, I detected that ESXi had issues talking to the USB device, where ESXi was installed. I found a related KB from VMware, outlining issues with the new VMFS-L, which started my baseline for troubleshooting VMFS-L Locker partition corruption on SD cards in ESXi 7.0 (83376)

In short, it says that the VMFS-L partition may have become corrupt, and a re-install is needed. There is no resolution for the SD card corruption as of the time this article was published

Mentioned workaround, suggesting moving the scratch partition, is not applicable in my case, as I’ve already verified that my scratch partition is running from RAMDISK.

Verify scratch mountpoint

[root@esx-13:~] vmkfstools -Ph /scratch/
visorfs-1.00 (Raw Major Version: 0) file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 3.9 GB, 3.1 GB available, file block size 4 KB, max supported file size 0 bytes
Disk Block Size: 4096/4096/0
UUID: 00000000-00000000-0000-000000000000
Partitions spanned (on "notDCS"):
    memory
Is Native Snapshot Capable: NO

List content of the VMFS-L partition (LOCKER)

I also ran a quick findcommand (from another working host), to get all contents of the VMFS-L mounted partition. Notice that the vmtoolsRepo packages are located here.

[root@esx-11:~] find /vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.fbb.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.fdc.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.pbc.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.sbc.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.vh.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.pb2.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.sdd.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/.jbc.sf
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/vibs
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/vibs/tools-light--2910230392612735297.xml
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/vibs/tools-light--2910230392612735297.xml.sig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/vibs/tools-light--2910230392612735297.xml.orig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/bulletins
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/profiles
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/baseimages
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/addons
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/solutions
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/manifests
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/reservedComponents
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/var/db/locker/reservedVibs
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/floppies
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/floppies/pvscsi-Windows2008.flp
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/floppies/pvscsi-Windows8.flp
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/floppies/pvscsi-WindowsVista.flp
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/isoimages_manifest.txt
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/isoimages_manifest.txt.sig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/linux.iso
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/linux.iso.sig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/linux_avr_manifest.txt
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/linux.iso.sha
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/linux_avr_manifest.txt.sig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/windows.iso
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/windows.iso.sha
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/windows.iso.sig
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/windows_avr_manifest.txt
/vmfs/volumes/LOCKER-6092ba2b-1fdb3f52-337c-000c292b45b0/packages/vmtoolsRepo/vmtools/windows_avr_manifest.txt.sig

Getting a host with issues in maintenance mode — physically remove the USB device first #

Getting the ESXi hosts with USB issues into maintenance mode, was also a little tricky. Used to doing things “remote”, I wanted to try evacuating the VMs the usual way (just enter maintenance mode, and DRS will handle the rest), but this was a no-go. While entering maintenance mode, the VMs would start being vMotioned (job status), but nothing actually happened. All VMs “started” the Migrating/vMotion job (status 9%, or 12% in vCenter), but checking the host with esxtop, under network, I found that no traffic was occuring on the vMotion interface, which usually is at full pipe, when vMotion occurs.

Re-checking the logs, the issues with USB repeated, again and again. I thought I’d try to physically remove the USB device from the host, as this would trigger an “proper” All Paths Down (APD) on the USB device.

So I physically removed the USB device. Waited 2-3 minutes, and boom - the vMotion process finished at once. Digging into the logs (again, /var/log/vmkernel.log has the answers), I could verify the APD event.

2021-05-15T14:00:03.326Z cpu7:1048720)StorageApdHandler: 606: APD timeout event for 0x43040c4c34d0 [mpx.vmhba32:C0:T0:L0]
2021-05-15T14:00:03.326Z cpu7:1048720)StorageApdHandlerEv: 126: Device or filesystem with identifier [mpx.vmhba32:C0:T0:L0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast faile$
2021-05-15T14:00:03.326Z cpu3:1048731)ScsiDeviceIO: 4277: Cmd(0x4578c1283080) 0x1a, CmdSN 0x93be0 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0
2021-05-15T14:00:03.326Z cpu3:1048731)WARNING: NMP: nmp_DeviceStartLoop:740: NMP Device "mpx.vmhba32:C0:T0:L0" is blocked. Not starting I/O from device.
2021-05-15T14:00:03.326Z cpu3:1055182)LVM: 6817: Forcing APD unregistration of devID 6092ba2b-13467d16-8d9c-000c292b45b0 in state 1.
2021-05-15T14:00:03.326Z cpu3:1055182)LVM: 6192: Could not open device mpx.vmhba32:C0:T0:L0:7, vol [6092ba2a-e004ead6-09c5-000c292b45b0, 6092ba2a-e004ead6-09c5-000c292b45b0, 1]: No connection
2021-05-15T14:00:03.326Z cpu3:1055182)Vol3: 2129: Could not open device 'mpx.vmhba32:C0:T0:L0:7' for volume open: Not found
2021-05-15T14:00:03.326Z cpu3:1055182)Vol3: 4339: Failed to get object 28 type 1 uuid 6092ba2b-1fdb3f52-337c-000c292b45b0 FD 0 gen 0 :Not found
2021-05-15T14:00:03.326Z cpu3:1055182)WARNING: Fil3: 1534: Failed to reserve volume f533 28 1 6092ba2b 1fdb3f52 c00337c b0452b29 0 0 0 0 0 0 0
2021-05-15T14:00:03.326Z cpu3:1055182)Vol3: 4339: Failed to get object 28 type 2 uuid 6092ba2b-1fdb3f52-337c-000c292b45b0 FD 4 gen 1 :Not found
2021-05-15T14:00:03.326Z cpu4:2205969)VFAT: 5144: Failed to get object 36 type 2 uuid 4365f3f4-494e65bd-7b92-e7c78fac244e cnum 0 dindex fffffffecdate 0 ctime 0 MS 0 :No connection
2021-05-15T14:00:03.326Z cpu3:1051988)LVM: 6817: Forcing APD unregistration of devID 6092ba2b-13467d16-8d9c-000c292b45b0 in state 1.
2021-05-15T14:00:03.326Z cpu3:1051988)LVM: 6817: Forcing APD unregistration of devID 6092ba2b-13467d16-8d9c-000c292b45b0 in state 1.

So I got both hosts in maintenance mode, and rebooted. Everything was working again.

New findings, from an old value #

Continuing my research, I stumbled upon a new thread in the Dell Communities (VMware 7.0 U2 losing contact with SD card, where VMware Support sent a workaround from an older KB, related to moving vmtoolsrepo to RAMDISK. High frequency of read operations on VMware Tools image may cause SD card corruption (2149257)

In ESXi 6.0 Update 3 and later, changes were made to reduce the number of read operations being sent to the SD card, an advanced parameter was introduced that allows you to migrate your VMware tools image to ramdisk on boot . This way, the information is read only once from the SD card per boot cycle.

Note: Even though KB2149257 currently only targets ESXi 6.0 and 6.5 (doesn’t mention ESXi 7.0 at all, as of time of writing), I’m guessing the same workaround now may apply in ESXi 7.0 U1+. Especially if the old “throttle” (fix in 6.0 U3) now is removed, while continuing improving the new VMFS-L.

Applying the workaround — adding option ToolsRamdisk #

As mentioned in KB2149257, I added the ToolsRamdisk option on all hosts with ESXi 7.0 build 17867351 U2a

Steps

  • Creating the option first
esxcfg-advcfg -A ToolsRamdisk --add-desc "Use VMware Tools repository from /tools ramdisk" --add-default "0"  --add-type 'int' --add-min "0" --add-max "1"
  • Setting the value to 1
esxcli system settings advanced set -o /UserVars/ToolsRamdisk -i 1
  • Verifiying the value is set
esxcli system settings advanced list -o /UserVars/ToolsRamdisk
  • Reboot the host (as setting applies at boot)

Verify new tools mountpoint running from RAMDISK #

After a reboot, I found the newly created mountpoint located under /tools. Checking the location with vmkfstools -Ph, we can see that it’s mounted in a RAMDISK.

Checking mountpoint with ls

[root@esx-11:~] ls -hal /tools/
total 16
drwxrwxrwt    1 root     root         512 May 18 14:56 .
drwxr-xr-x    1 root     root         512 May 18 18:18 ..
drwxr-xr-x    1 root     root         512 May 18 14:56 floppies
drwxr-xr-x    1 root     root         512 May 18 14:56 vmtools

Getting mountpoint location with vmkfstools -Ph

[root@esx-11:~] vmkfstools -Ph /tools/
visorfs-1.00 (Raw Major Version: 0) file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 4.2 GB, 3.2 GB available, file block size 4 KB, max supported file size 0 bytes
Disk Block Size: 4096/4096/0
UUID: 00000000-00000000-0000-000000000000
Partitions spanned (on "notDCS"):
	memory
Is Native Snapshot Capable: NO

Checking vmkernel.log for boot events, containg the word “tools”

# Check vmkernel.log for tools-related hits
[root@esx-11:~] cat /var/log/vmkernel.log|grep -i tools
2021-05-18T14:55:44.765Z cpu7:1048823)SchedVsi: 2098: Group: host/vim/vimuser/vmtoolsd(1725): min=46 max=46 minLimit=46, units: mb
2021-05-18T14:56:02.361Z cpu2:1048852)Activating Jumpstart plugin vmtoolsRepo.
2021-05-18T14:56:02.399Z cpu3:1049894)VisorFSRam: 871: tools with (0,286,0,256,1777)
2021-05-18T14:56:02.399Z cpu3:1049894)FSS: 8565: Mounting fs visorfs (430547881820) with -o 0,286,0,256,0,01777,tools on file descriptor 43054e9b9230
2021-05-18T14:56:15.302Z cpu3:1048852)Jumpstart plugin vmtoolsRepo activated.
2021-05-18T14:56:21.821Z cpu6:1050194)Starting service vmtoolsd
2021-05-18T14:56:21.830Z cpu6:1050194)Activating Jumpstart plugin vmtoolsd.
2021-05-18T14:56:21.852Z cpu4:1050194)Jumpstart plugin vmtoolsd activated.

Listing content of the /tools directory

[root@esx-11:~] find /tools/
/tools/
/tools/floppies
/tools/floppies/pvscsi-WindowsVista.flp
/tools/floppies/pvscsi-Windows2008.flp
/tools/floppies/pvscsi-Windows8.flp
/tools/vmtools
/tools/vmtools/windows.iso.sig
/tools/vmtools/linux.iso.sha
/tools/vmtools/linux_avr_manifest.txt.sig
/tools/vmtools/isoimages_manifest.txt.sig
/tools/vmtools/linux.iso
/tools/vmtools/linux_avr_manifest.txt
/tools/vmtools/isoimages_manifest.txt
/tools/vmtools/windows.iso
/tools/vmtools/windows_avr_manifest.txt.sig
/tools/vmtools/windows_avr_manifest.txt
/tools/vmtools/windows.iso.sha
/tools/vmtools/linux.iso.sig

So yeah, there you have it. Perhaps using the standard profile on USB was a bad idea (which includes the VMware Tools - vs the “no-tools” profile). Usually I use the “no-tools” profile for USB installs, but I recently switched my USB devices to better SanDisk Ultra Fit SDCZ430-032G-G46 devices, which I thought was way better, and more stable.

Bonus: Tips on proactivly detecting issues on existing USB and SD card installs #

Tips: The followup might apply if there is issues with the USB or SD card in your environment

  • Running command df -h from CLI will get stuck, or fail, for the LOCKER mount (VMFS-L partition)
  • Checking the hosts logfile /var/log/vmkernel.log, you’ll notice entries similair to this
2021-05-15T13:48:27.674Z cpu6:1048743)ScsiDeviceIO: 4315: Cmd(0x4578c12ad880) 0x1a, cmdId.initiator=0x45389cb1a6f8 CmdSN 0x93a68 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1

I suggest setting up a vRLI alert on exact match on Cancelled from path layer. Cmd count Active, which I only found on faulty hosts, for now. I’ve actually set up a webhook alert. So if any USB issues arises, I immediatly get notified in my Slack channel, so I can react on it early on.

In summary #

  • Installing ESXi, using the no-tools-image (e.g. ESXi-70U2a-17867351-no-tools) is probably better suited for USB/SD-card installs, and maybe not require the option/workaround provided above.
  • User setting /UserVars/ToolsRamdisk outlined in KB2149257 loads vmtools to RAMDISK at boot (mounts under /tools), possible preventing burning out USB drives & SD cards (well, time will tell).
  • A funeral may be needed for my USB devices.

This is a post in the Guest Post series. Posts in this series:


Post last updated on September 9, 2021: Fix some frontmatter issues

About the author

Christian Mohn works as a Chief Technologist SDDC for Proact in Norway.

See his About page for more details, or find him on Twitter.

Sponsors