Mastodon

Expired VMware vCenter certificates

by Stine Elise Larsen · Read in about 8 min (1674 words)

AKA Something fishy in a sea of red herrings #

Guest Post #

Info

This is a guest post by Stine Elise Larsen, Senior Datacenter Consultant for Proact. #

You can find her on Twitter and LinkedIn. In a work environment she is normally overly cautious, perhaps even to the point where she tries to make herself virtually invisible. In the real world she plays arcade dance games and attends rock concerts.

Last week, I worked with a customer on what was seemingly a straightforward VMware vCenter 7 certificate replacement job but encountered several red herrings that also turned out to be issues that needed solving. I thought I’d share these in this post, in the hope that they can help others in future. The initial issue was that during the summer holidays, the customer’s certificates had expired, and they were presented with “Error 503, service unavailable” messages when trying to log into vSphere Client. While renewing certificates with certificate-manager in vCenter BASH Shell via SSH the services got stuck at 85%, and then failed to start after several minutes.

The first thing I checked was that the time was set correctly. If the time is incorrect, it can cause several issues. Since we were able to log into vCenter Server Appliance Management Interface (VAMI), I was able to check that the NTP source was set up and that there were no issues. This can also be done in vCenter BASH Shell via SSH by using the commands ntp.get to check the configured NTP, and the command date. I also checked if the STS certificate had expired

I then checked if there was enough disk space with df -h. After confirming that the services were still getting stuck on 85% after a reboot of the vCenter, I checked the certificate manager log, which you can find here: /var/log/vmware/vmcad/certificate-manager.log. In the log, I could see that there were inconsistencies in how the Fully Qualified Domain Name (FQDN) of the vCenter was written, and there were errors pointing to problems with the Subject Alternate Name (SAN).

Since I was unsure what was written in the old certificate, I had a look in the BACKUP_STORE using /usr/lib/vmware-vmafd/bin/vecs-cli entry list –store BACKUP_STORE –text | less, and compared it to the input that was used when renewing the certificate.

However, after correcting the new certificate, we were still stuck on 85%.

A colleague of mine had experienced a similar issue in his home lab and shared his notes with me, which pointed to the Primary Network Identifier (PNID) not being present in the certificate. I checked this by comparing /usr/lib/vmware-vmafd/bin/vmafd-cli get-pnid –server-name localhost with the certificate and the local hostname of the vCenter. None of these were written in the same case (some being lowercase and others in caps). I also used the vSphere Diagnostics Fling to make sure. I then used the PNID as the master and changed the hostname with this VAMI network configuration tool.

After all of this, I was confident that I had found the error, but it was still stuck on 85% after resetting the certificates. I then tried resetting the Security Token Service (STS) certificate and replacing all the certificates through certificate-manager, but no luck.

After a dive through all the logs imaginable, and man does vCenter have a lot of log files, I found an error in the vpxd.log which said, “Unable to get certificates from the store APPLMGMT_PASSWORD” alongside “Failed to read X509 cert; err: 151441516”.

My colleague had also experienced this before and suggested that I run the following commands: grep -Hinr “password error” /var/log/vmware/vmdird/.log* and grep -Hinr “Bind Request” /var/log/vmware/vmdird/.log*. As he expected, there were several “Bind request failed” errors and authentication errors pointing to password errors. These pointed to the Administrator user first, so I changed the password using the vdcadmintool, and tried restarting the services with no success. I then tried changing the STS certificate again and a new error appeared, which said: “Error 9234: Authentication to VMware Directory service failed”. Now, the grep commands were showing the password error for the machine account, so I changed that as well. Unfortunately, changing the certificates yet again failed when restarting the services.

Following that, we also tried changing the certificates from self-signed to custom, but the services still wouldn’t start. At this point, we contacted VMware support. The advisor from VMware support could tell by the order the services were trying to start up that something was wrong with the certificates, and it pointed to an SSL trust mismatch. This was finally fixed with the lsdoctor tool, and applying the –trustfix parameter to correct the SSL trust values and then restarting the services:

Running lsdoctor -l to identify issues #

root@vc01 [ /tmp/lsdoctor-master ]# python lsdoctor.py -l

    ATTENTION:  You are running a reporting function.  This doesn't make any changes to your environment.
    You can find the report and logs here: /var/log/vmware/lsdoctor

2022-08-03T12:04:11 INFO main: You are reporting on problems found across the SSO domain in the lookup service.  This doesn't make changes.
2022-08-03T12:04:11 INFO live_checkCerts: Checking services for trust mismatches...
2022-08-03T12:04:11 INFO generateReport: Listing lookup service problems found in SSO domain
2022-08-03T12:04:11 ERROR generateReport: default-site\vc01.local (VC 7.0 or CGW) found SSL Trust Mismatch: Please run python ls_doctor.py --trustfix option on this node.
2022-08-03T12:04:11 INFO generateReport: Report generated:  /var/log/vmware/lsdoctor/vc01.local-2022-08-03-120411.json
root@NO0137VMVC [ /tmp/lsdoctor-master ]#

Running lsdoctor -t or –trustfix to fix the trust issues #

root@vc01 [ /tmp/lsdoctor-master ]# python lsdoctor.py -t

    WARNING:  This script makes permanent changes.  Before running, please take *OFFLINE* snapshots
    of all VC's and PSC's at the SAME TIME.  Failure to do so can result in PSC or VC inconsistencies.
    Logs can be found here: /var/log/vmware/lsdoctor

2022-08-03T12:04:54 INFO main: You are checking for and fixing SSL trust mismatches in the local SSO site.  NOTE:  Please run this script one PSC or VC per SSO site.

Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


Provide password for administrator@vm-oss.local:
2022-08-03T12:05:15 INFO __init__: Retrieved services from SSO site: default-site
2022-08-03T12:05:15 INFO findAndFix: Checking services for trust mismatches...
2022-08-03T12:05:15 INFO findAndFix: Attempting to reregister 4de1c858-08a7-43ec-903b-ca8198b08cb4_kv for vc01.local
2022-08-03T12:05:15 INFO findAndFix: Attempting to reregister 0352ce58-1812-47fa-ab3c-db913e8ad484 for vc01.local
2022-08-03T12:05:16 INFO findAndFix: Attempting to reregister f2314a00-e755-46d0-a689-4b7f389aedce for vc01.local
2022-08-03T12:05:16 INFO findAndFix: Attempting to reregister d935f890-87ae-459a-b3e7-3c2fb3ad1ceb for vc01.local
2022-08-03T12:05:16 INFO findAndFix: Attempting to reregister default-site:c8d08fa1-c4c0-4a45-bd70-c725d462ceb9 for vc01.local
2022-08-03T12:05:16 INFO findAndFix: Attempting to reregister 6fb48414-9db1-4c84-80d3-fadc5dbdd4aa for vc01.local
2022-08-03T12:05:16 INFO findAndFix: Attempting to reregister default-site:44b21511-a65b-4fbb-90ca-e01f3a35b16b for vc01.local
2022-08-03T12:05:17 INFO findAndFix: Attempting to reregister c59863c5-c736-4c77-b2da-523e0d3444df for vc01.local
2022-08-03T12:05:17 INFO findAndFix: Attempting to reregister 81f488c3-d975-417d-8341-012e135c1de8 for vc01.local
2022-08-03T12:05:17 INFO findAndFix: Attempting to reregister 71bccc22-085e-4bc9-92ce-c8dafa75d03f for vc01.local
2022-08-03T12:05:17 INFO findAndFix: Attempting to reregister aa296055-6610-4372-96af-b02e2034b320 for vc01.local
2022-08-03T12:05:17 INFO findAndFix: Attempting to reregister a0e44029-8f89-4ba2-aacd-81e690a218d0 for vc01.local
2022-08-03T12:05:18 INFO findAndFix: Attempting to reregister f4174796-4081-4d52-abc0-fc72384e3c08 for vc01.local
2022-08-03T12:05:18 INFO findAndFix: Attempting to reregister 12c24187-9d14-414d-bf73-e5b95e35ee80 for vc01.local
2022-08-03T12:05:18 INFO findAndFix: Attempting to reregister fa2bf3a9-efb7-41c2-88d2-5da0b8537e2e for vc01.local
2022-08-03T12:05:18 INFO findAndFix: Attempting to reregister 113e9a79-53a8-4bac-a098-00f86e6052cd for vc01.local
2022-08-03T12:05:18 INFO findAndFix: Attempting to reregister 1730d264-c24f-4a00-8954-cec9c622a126 for vc01.local
2022-08-03T12:05:19 INFO findAndFix: Attempting to reregister 68b48005-6902-406c-ab83-590aec3436d4 for vc01.local
2022-08-03T12:05:19 INFO findAndFix: Attempting to reregister 4de1c858-08a7-43ec-903b-ca8198b08cb4_authz for vc01.local
2022-08-03T12:05:19 INFO findAndFix: Attempting to reregister 74c3b715-23c4-4a84-bd7a-b4779b0947bc for vc01.local
2022-08-03T12:05:19 INFO findAndFix: Attempting to reregister 53bd1727-8664-4b06-bc5b-25cf77d37e59 for vc01.local
2022-08-03T12:05:20 INFO findAndFix: Attempting to reregister 06c4cd74-6aeb-4380-8c2f-04b8fba64eb2 for vc01.local
2022-08-03T12:05:20 INFO findAndFix: Attempting to reregister 1878c18b-fb06-4e43-b46e-9682c64b40ed for vc01.local
2022-08-03T12:05:20 INFO findAndFix: Attempting to reregister 605f8faa-f7c8-447a-b27b-b8f6e839e41e for vc01.local
2022-08-03T12:05:21 INFO findAndFix: Attempting to reregister 68a74ee0-2a99-43bb-b663-0e649325e8bb for vc01.local
2022-08-03T12:05:21 INFO findAndFix: Attempting to reregister 35d8d394-fc08-4acc-844f-392adbdb86bb for vc01.local
2022-08-03T12:05:21 INFO findAndFix: Attempting to reregister a73f8eba-a5a0-447a-8b81-c557692f7727 for vc01.local
2022-08-03T12:05:22 INFO findAndFix: Attempting to reregister db13ccae-b4fa-4ce1-a2ae-f68e081373ff for vc01.local
2022-08-03T12:05:22 INFO findAndFix: Attempting to reregister a096685e-ef11-4301-9d61-deeea4cced69 for vc01.local
2022-08-03T12:05:22 INFO findAndFix: Attempting to reregister b46c40af-33b8-4e8e-91a2-034b7286e679 for vc01.local
2022-08-03T12:05:22 INFO findAndFix: Attempting to reregister 2dd172a6-64e4-43db-9ee5-1aae64171d91 for vc01.local
2022-08-03T12:05:22 INFO findAndFix: Attempting to reregister 4de1c858-08a7-43ec-903b-ca8198b08cb4 for vc01.local
2022-08-03T12:05:23 INFO findAndFix: Attempting to reregister 77f9c3f9-582f-4570-96ff-6d2efd3f87e5 for vc01.local
2022-08-03T12:05:23 INFO findAndFix: Attempting to reregister a62ce6f2-733e-4ee4-8157-1340f9dec30f for vc01.local
2022-08-03T12:05:23 INFO findAndFix: Attempting to reregister 5d845f2e-8c91-48d4-b5d9-04eb81d3c569 for vc01.local
2022-08-03T12:05:24 INFO findAndFix: Attempting to reregister 739c41fd-f740-4742-8a27-aca98e744296 for vc01.local
2022-08-03T12:05:24 INFO findAndFix: Attempting to reregister default-site:a7622a00-b4a2-41b4-9242-35e58beb7bde for vc01.local
2022-08-03T12:05:24 INFO findAndFix: Attempting to reregister 0a8283c3-706c-4ac7-985d-6bf43dc8b8d7 for vc01.local
2022-08-03T12:05:24 INFO findAndFix: Attempting to reregister 6e77038c-c6f5-4a06-b6a4-e9685fab7ade for vc01.local
2022-08-03T12:05:24 INFO findAndFix: Attempting to reregister ec5e3e40-5226-44c0-bda6-b76a0eb272de for vc01.local
2022-08-03T12:05:25 INFO findAndFix: Attempting to reregister ab5fc7c7-407c-4134-8760-42124d7c3cb3 for vc01.local
2022-08-03T12:05:25 INFO findAndFix: Attempting to reregister 4f2e79de-9b47-4158-bd5e-a53bce45002c for vc01.local
2022-08-03T12:05:25 INFO findAndFix: Attempting to reregister c04bd00d-3a3f-45b3-a307-49e26c380ba0 for vc01.local
2022-08-03T12:05:25 INFO findAndFix: Attempting to reregister ec4c6fbf-bf3f-4594-9064-d9387ea803c3 for vc01.local
2022-08-03T12:05:25 INFO findAndFix: We found 44 mismatch(s) and fixed them :)
2022-08-03T12:05:25 INFO main: Please restart services on all PSC's and VC's when you're done.

Once this was done the services started up again successfully and the vCenter was operational again.

Checking service status #

root@vc01 [ /tmp/lsdoctor-master ]# watch service-control --status --all
Every 2.0s: service-control --status --all                                                                                                       vc01.local: Wed Aug  3 12:16:19 2022

Running:
 applmgmt lookupsvc lwsmd observability observability-vapi pschealth vlcm vmafdd vmcad vmdird vmonapi vmware-analytics vmware-certificateauthority vmware-certificatemanagement vmware-cis-license vmware-c
ontent-library vmware-eam vmware-envoy vmware-hvc vmware-infraprofile vmware-perfcharts vmware-pod vmware-postgres-archiver vmware-rhttpproxy vmware-sca vmware-sps vmware-statsmonitor vmware-stsd vmware-
topologysvc vmware-trustmanagement vmware-updatemgr vmware-vapi-endpoint vmware-vdtc vmware-vmon vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vmware-vsan-health vmware-vsm vsphere-ui vstats vtsdb wcp
Stopped:
 vmcam vmware-imagebuilder vmware-netdumper vmware-rbd-watchdog vmware-vcha

Conclusion #

In conclusion, there was a lot to learn from this issue. Firstly, what might seem like red herrings, may very well be underlying problems that also needs to be solved. In the end, the final issue was fixed by VMware Support, but without all the troubleshooting steps performed before they were brought on board, the fix might not have been so straight forward.

Secondly, certificate issues like this are notoriously hard to troubleshoot, especially given the interdependencies between them and the internal vCenter services.

Thirdly, there is no shame in asking for help.

As Nick Craver, Principal Software Engineer @ Microsoft put it:

I would also like to highlight these tools if you have VMware vCenter certificate issues:


This is a post in the Guest Post series. Posts in this series:


Post last updated on August 8, 2022: Update Expired-VMware-vCenter-7-certificates.md

About the author

Christian Mohn Profile Picture

Christian Mohn works as a Chief Technologist SDDC for Proact in Norway.

See his About page for more details, or find him on Twitter.

Sponsors