Mastodon

Hot Add NVMe Device Caused PSOD on ESXi

by Stine Elise Larsen · Read in about 2 min (380 words)

Guest Post #

Info

This is a guest post by Stine Elise Larsen, Senior Datacenter Consultant for Proact. #

You can find her on Twitter and LinkedIn. In a work environment she is normally overly cautious, perhaps even to the point where she tries to make herself virtually invisible. In the real world she plays arcade dance games and attends rock concerts.

For a list of all of Stines posts, see Guest Authors.

I recently had a case of “go with your gut” when we added some new NVMe disks to an existing VMware vSAN solution at a customer.

Normally I’m very cautious and will put hosts into maintenance mode, no matter how small the hardware change I’m doing is, but against my better judgement this time I decided to hot add some disks (which of course is supported). However, I fumbled and managed to insert it and quickly remove it again before inserting it again, and ended up with a dreaded Purple Screen of Death (PSOD) on the host.

Naturally this freaked me out and I was eager to figure out what the problem was. Searching through the KBs at VMware didn’t give me any clues, but a quick Google search took me to the ESXi 7.0u2c Release Notes:

PR 2708326: If an NVMe device is hot added and hot removed in a short interval, the ESXi host might fail with a purple diagnostic screen. If an NVMe device is hot added and hot removed in a short interval, the NVMe driver might fail to initialize the NVMe controller due to a command timeout. As a result, the driver might access memory that is already freed in a cleanup process. In the backtrace, you see a message such as WARNING: NVMEDEV: NVMEInitializeController:4045: Failed to get controller identify data, status: Timeout.

Eventually, the ESXi host might fail with a purple diagnostic screen with an error similar to #PF Exception … in world …:vmkdevmgr. This issue is resolved in this release.”

Luckily, there were no more errors after hot adding the disks and rebooting the host, so the next step is of course some patching.

I did not experience the same issue on any of the other hosts in that cluster, probably due to steadier hands or less caffeine in my bloodstream.


This is a post in the Guest Post series. Posts in this series:


Post last updated on December 22, 2022: Fix Guest post link

About the author

Christian Mohn Profile Picture

Christian Mohn works as a Chief Technologist SDDC for Proact in Norway.

See his About page for more details, or find him on Twitter.

Sponsors