I recently had a case of “go with your gut” when we added some new NVMe disks to an existing VMware vSAN solution at a customer.
Normally I’m very cautious and will put hosts into maintenance mode, no matter how small the hardware change I’m doing is, but against my better judgement this time I decided to hot add some disks (which of course is supported). However, I fumbled and managed to insert it and quickly remove it again before inserting it again, and ended up with a dreaded Purple Screen of Death (PSOD) on the host.
Naturally this freaked me out and I was eager to figure out what the problem was. Searching through the KBs at VMware didn’t give me any clues, but a quick Google search took me to the ESXi 7.0u2c Release Notes:
PR 2708326: If an NVMe device is hot added and hot removed in a short interval, the ESXi host might fail with a purple diagnostic screen. If an NVMe device is hot added and hot removed in a short interval, the NVMe driver might fail to initialize the NVMe controller due to a command timeout. As a result, the driver might access memory that is already freed in a cleanup process. In the backtrace, you see a message such as WARNING: NVMEDEV: NVMEInitializeController:4045: Failed to get controller identify data, status: Timeout.
Eventually, the ESXi host might fail with a purple diagnostic screen with an error similar to #PF Exception … in world …:vmkdevmgr. This issue is resolved in this release.”
Luckily, there were no more errors after hot adding the disks and rebooting the host, so the next step is of course some patching.
I did not experience the same issue on any of the other hosts in that cluster, probably due to steadier hands or less caffeine in my bloodstream.
