Alerting on “Bootbank cannot be found at path ‘/bootbank’” in vRealize Operations

If you boot your ESXi hosts from SD-cards or USB you might have run into this issue. Suddenly your host(s) displays the following under events:
“Bootbank cannot be found at path ‘/bootbank’.”

Usually this means that the boot device has been corrupted somehow, either due to a device failure or other issues. Normally the host continues to run, until it’s rebooted that is…

For some reason, vRealize Operations doesn’t pick this up as a host issue that it alerts on, so if your alerting regime is based on vROps alerts, you might not get alerted immediately. Thankfully there is a way to remedy this, and have vROps and vRealize Log Insight work together at the same time.

On order for this to work, you need to have configured the vRealize Log Insight Integration with vRealize Operations first.

Log in to Log Insight and search for “Bootbank cannot be found at path ‘/bootbank’.”. If you want to restrict it even more, use two filters. One for vc_event_type = exists, and one for the text search itself.

Click on the little red bell icon and select “Create Alert from Query”.

This will bring up the “Edit Alert” window, where you can define your information.

Create a proper Description and Recommendation for the alert, and enable “Send to vRealize Operations Manager”. You also need to specify a Fallback Option. The Fallback Option is basically which object Log Insight should attach the alert to, if the originating object isn’t found in vRealize Operations.

And that’s it really, as long as the vLI and vROps integration is configured and working, it’s easy add your own custom alerts in vRLI, and have them pop up in vROps.

If you want to copy my Description and Recommendations, here they are:

Description:

“The device containing the VMware ESXi bootbank can not be found. This may be because of a boot device failure. Specific details should be available in the symptom details. For more information, check the Tasks & Events pane for the host in the vSphere Web Client”

Recommendation:

“Change or replace the boot device, if necessary. Contact the hardware vendor for assistance. After the problem is resolved, the alert will be canceled when the sensor that reported the problem indicates that the problem no longer exists.”

 

VMware vSphere 6.5 PSOD: GP Exception 13

While at a customer site, migrating an old vSphere 5.5 environment to 6.5, several hosts suddenly crashed with a PSOD during the migration. Long story short, we got hit by this: VMware KB 2147958: ESXi 6.5 host fails with PSOD: GP Exception 13 in multiple VMM world at VmAnon_AllocVmmPages (2147958)

It turned out that a bunch of the VMs we were vMotioning from the old environment had the cpuid.corePerSocket advanced setting set in the .vmx file, and this can cause ESXi 6.5 to enter a state of panic, and in our case it certainly did.

Upgrading the hosts to 6.5a, like the knowledgebase article states, alleviated the issue and we did not experience PSOD’s again while migrating the 100+ VMs from the old environment to the new one.

ESXi Snapshot Problems: msg.snapshot.error-QUIESCINGERROR

Photo by Sonja Langford

Just a quick post about something I experienced at a client, with ESXi 6.0 hosts, today:

If you have trouble performing VMware snapshots, and see a  msg.snapshot.error-QUIESCINGERROR error, check the host time settings and NTP.

In this case, snapshots of VMs located on other hosts in the cluster were fine, but once a VM was moved to the new host, snapshot operations failed after an hour or so.

It turns out a new host in the cluster was not properly set up to use NTP, and time drift between the host and the vCenter caused the snapshot failures. Correcting the time on the host and configuring NTP resolved the issue.

Always remember: If the problem isn’t DNS, it almost certainly is NTP.