Guest Post: The Curious Case of the Intel Microcode Part #2 – It Gets Better — Then Worse

This guest post by Bjørn Anders Jørgensen, Senior Systems Consultant Basefarm, first appeared on LinkedIn.

Before you start on this rather long post, have a go at part #1:

 tl;dr

This is a long read. To get to the juicy part on how Intel potentially shipped pre-release microcode to partners, skip to section 3. The short, short version is that the official Intel microcode update contains newer microcode for Skylake-SP and Kaby Lake/Coffe Lake than what currently is shipping from VMware/HPE/DellEMC etc.

Section 1: The good

Last week I wrote about how Intel should improve their microcode update delivery mechanism and offer full disclosure on their microcode changes. Then this week events progressed in rapid succession:

  • Intel released updated microcode bundle 20180108
  • VMware released updated patches for vSphere, including microcode updates
  • Then it was discovered that the updates cause stability issues
  • Most computer vendors recalled BIOS updates for Haswell/Broadwell
  • VMware recommended not to expose VMs to the new CPU feature flag
  • I discovered that Xeon SP and Kaby Lake/Coffe Lake updates from most or all vendors is based on the pre-release Intel bundle 20171215!

First of all I have to say I feel for all the engineers and product managers taken by surprise when the news broke early on the Meltdown and Spectre vulnerabilities. Apparently the embargo was suppose to be lifted on the 9. January, and everyone was working towards this date. There must have been extreme pressure from peers and customers. Mistakes will be made, but could have been avoided if Intel was more transparent on their process and releases.

Lets start with the Intel microcode bundle. Deconstructing the bundle using MC extractor you’ll find 94 binary files containing 196 individual microcode updates. Intel has released a single bundle with all updates going back to 1998. This is good! As far as I know there has not been a single update bundle going back this far. Additionally all vSphere patches contains microcode updates for most production systems running vSphere, including an update for AMD EPYC processors as described in the security advisory. You can find all the details on my github repository.

This is progress – however there are still no release notes or details except for the updated microcode versions for each processor family:

- Updates upon 20171117 release --
IVT C0          (06-3e-04:ed) 428->42a
SKL-U/Y D0      (06-4e-03:c0) ba->c2
BDW-U/Y E/F     (06-3d-04:c0) 25->28
HSW-ULT Cx/Dx   (06-45-01:72) 20->21
Crystalwell Cx  (06-46-01:32) 17->18
BDW-H E/G       (06-47-01:22) 17->1b
HSX-EX E0       (06-3f-04:80) 0f->10
SKL-H/S R0      (06-5e-03:36) ba->c2
HSW Cx/Dx       (06-3c-03:32) 22->23
HSX C0          (06-3f-02:6f) 3a->3b
BDX-DE V0/V1    (06-56-02:10) 0f->14
BDX-DE V2       (06-56-03:10) 700000d->7000011
KBL-U/Y H0      (06-8e-09:c0) 62->80
KBL Y0 / CFL D0 (06-8e-0a:c0) 70->80
KBL-H/S B0      (06-9e-09:2a) 5e->80
CFL U0          (06-9e-0a:22) 70->80
CFL B0          (06-9e-0b:02) 72->80
SKX H0          (06-55-04:b7) 2000035->200003c
GLK B0          (06-7a-01:01) 1e->22

Note that 20171117 is Intels previous official release

Keep a mental note for version 80 for Kaby Lake/Coffee Lake and 200003c for Skylake-SP (Xeon SP). Also note the absence of Sandy Bridge updates.

Section 2: The bad

With new microcode and OS patches available, tens if not hundreds of thousands of IT professionals spend much of last week planning and executing mitigation for the Meltdown and Spectre vulnerabilities. And with Spectre variant #2 (Branch Target Injection) requiring microcode update, I will assume most also started deploying updated BIOS and Intel patches through the OS mechanism for Linux and vSphere.

With the wider deployment it did not take long for customers to discover there where issues with system crashes or “higher system reboots” as Intel marking tries to spin it. How are system administrators suppose to make a informed decision based on that? It is quite ironic that Intel CEO, Brian Krzanich at the same time makes his “Security-First Pledge” offering “Transparent and Timely Communications”. It seems like Lenovo is the only vendor with some actual details on the issues:

  1. (Kaby Lake U/Y, U23e, H/S/X) Symptom: Intermittent system hang during system sleep (S3) cycling. If you have already applied the firmware update and experience hangs during sleep/wake, please flash back to the previous BIOS/UEFI level, or disable sleep (S3) mode on your system; and then apply the improved update when it becomes available. If you have not already applied the update, please wait until the improved firmware level is available.
  2. (Broadwell E) Symptom: Intermittent blue screen during system restart. If you have already applied the update, Intel suggests continuing to use the firmware level until an improved one is available. If you have not applied the update, please wait until the improved firmware level is available.
  3. (Broadwell E, H, U/Y; Haswell standard, Core Extreme, ULT) Symptom: Intel has received reports of unexpected page faults, which they are currently investigating. Out of an abundance of caution, Intel requested Lenovo to stop distributing this firmware.

With all the system manufacturers removing their BIOS downloads for Haswell and Broadwell and VMware recommending to “hide the speculative-execution control mechanism for virtual machines” if patches have already applied, this cannot be described as nothing short of an industry wide recall. It is hard to say how bad it is, but if it is so bad that Intel is recalling the update and CEO and EVP posting publicly I think it is wise to await further updates and see how things pan out. Intel will release updated microcode soon and the various vendors will post new BIOS updates in the coming few weeks. Go to the VMware link and note the various processors affected in yellow, because it gets worse:

Section 3: The ugly

Disclaimer: The following analysis is made with best effort on little information. Intel is shipping old and new versions in their microcode update bundle. It may be that they are for different CPU steppings of the same processor.

While all this is unfolding I am trying to piece together what the Intel and VMware patches actually contains. Using MC extractor I was able to inspect the microcode bundles, map the updates to CPU models and cross reference different releases. The complete list is on github. In the rush to get patches out, it seems like Intel made some last minute changes that was not communicated to partners. I have been in contact with several vendors and none seems to be aware that there changes made between the unofficial pre-release bundle from 15. December that was shared under embargo. Luckily this only affect three processor generations, but combined with the Haswell/Broadwell issues, it leaves us with an awful mess and a potential open vulnerability.

To quote my one week younger self:

So with six months lead time Intel has not been able to release an official microcode bundle update. The latest official one is from 17. November and does not seem to contain any Spectre fixes. Worse, it seems like the unofficial bundle is only partially fixing variant #2/Spectre.

It is obvious that Intel made some last minute changes to microcode and failed to notify partners. What changes has been made between update bundle 20171215 and 20180108? Are we installing incomplete non-released microcode to millions of computers? Only Intel knows! It might be as simple as an effort to include some minor fixes now what everyone with an Intel processor have to update their computer, or it can be a monumental error where many computers are still partially vulnerable to Spectre even with patches installed. With the “security first” pledge from CEO Brian Krzanich promising transparency I expect and demand an explanation from Intel. How is it that VMware, HPE and DellEMC can ship pre-release microcode to millions of computers?

Intel microcode update bundle pre-release 15. December
+-------+--------------------------+---------+------------+---------+---------+----------+----------+--------+
| CPUID |         Platform         | Version |    Date    | Release |   Size  | Checksum |  Offset  | Latest |
| 50654 |  B7 [0, 1, 2, 4, 5, 7]   | 200003A | 2017-11-21 |   PRD   |  0x6C00 | C088D252 | 0x64400  |  Yes   |
| 806E9 |        C0 [6, 7]         |    7C   | 2017-12-03 |   PRD   | 0x18000 | 5C75A5FE | 0x8B000  |   No   |
| 806EA |        C0 [6, 7]         |    7C   | 2017-12-03 |   PRD   | 0x17C00 | B81BC926 | 0xA3000  |  Yes   |
| 906E9 |       2A [1, 3, 5]       |    7C   | 2017-12-03 |   PRD   | 0x18000 | 6CF72404 | 0xBAC00  |   No   |
| 906EA |        22 [1, 5]         |    7C   | 2017-12-03 |   PRD   | 0x17800 | 55695D1F | 0xD2C00  |  Yes   |
| 906EB |          02 [1]          |    7C   | 2017-12-03 |   PRD   | 0x18000 | 5046D998 | 0xEA400  |  Yes   |
+-------+--------------------------+---------+------------+---------+---------+----------+----------+--------+

Intel microcode update bundle released 8. january
+-------+--------------------------+---------+------------+---------+---------+----------+----------+--------+
| CPUID |         Platform         | Version |    Date    | Release |   Size  | Checksum |  Offset  | Latest |
+-------+--------------------------+---------+------------+---------+---------+----------+----------+--------+
| 50654 | B7 [0, 1, 2, 4, 5, 7]    | 200003C | 2017-12-08 |   PRD   | 0x6C00  | A4059069 |  0x0     |  Yes   |
| 806E9 | C0 [6, 7]                |    80   | 2018-01-04 |   PRD   | 0x18000 | 6961A256 |  0x0     |  Yes   |
| 806EA | C0 [6, 7]                |    80   | 2018-01-04 |   PRD   | 0x18000 | F6263DAE |  0x0     |  Yes   |
| 906E9 | 2A [1, 3, 5]             |    80   | 2018-01-04 |   PRD   | 0x18000 | 6AA1DE93 |  0x0     |  Yes   |
| 906EA | 22 [1, 5]                |    80   | 2018-01-04 |   PRD   | 0x17C00 | 84CABC68 |  0x0     |  Yes   |
| 906EB |  02 [1]                  |    80   | 2018-01-04 |   PRD   | 0x18000 | D24EDB7F |  0x0     |  Yes   |
+-------+--------------------------+---------+------------+---------+---------+----------+----------+--------+

Note that all updates are labeled PRD for production. If we take the VMware matrix and mark the recalled microcode in yellow and the potential pre-release in red, we get a pretty depressing chart:

The same versions was also shipped in Dell BIOS updates. Thanks to Dell engineering for actually listing the microcode updates, I have not seen the same transparency from other vendors. HPE take note!

R740/R740xd/R640/R940/7920R

Updated the Intel Xeon Processor Scalable Family Processor Microcode to version 0x3A.

R830                            

Updated the Xeon Processor E5-2600 v3 Product Family Processor Microcode to version 0x3B
Updated the Intel Xeon Processor E5-2600 v4 Product Family Processor Microcode to version 0x0b000025

R630/R730/R730XD

Updated the Xeon Processor E5-2600 v3 Product Family Processor Microcode to version 0x3B
Updated the Intel Xeon Processor E5-2600 v4 Product Family Processor Microcode to version 0x0b000025

R930

Updated the Intel Xeon processor E7-4800/8800 v3 Product Family Processor Microcode to version 0x10
Updated the Intel Xeon processor E7-4800/8800 v4 Product Family Processor Microcode to version 0x0b000025

The Rx30 updates have now been removed from the Dell web page due to the aforementioned issues, but the R740 update is still there. Should it be recalled as well? Hard to tell, I think an update from Intel is in order.

I have not been able to attempt to load the new microcode on a server to confirm that it actually would apply, but I tested on my Precision 5520 laptop that use a Xeon E3-1505M v6 processor with the help of the VMware microcode update driver for Windows and the microcode will update from 7C to 80:

Some advice

So what is a system administrator to do? My advice is to wait it out. The situation is still fluid and changes almost on a daily basis. Let the dust settle and hope for some more clarity in the coming week. Remember that microcode updates are only for Spectre variant #2 vulnerability which is the hardest to exploit. If you already started patching vSphere, consider a rollback or implement the workaround from VMware.

Focus on patching Meltdown on soft targets such as laptops. These devices are much more likely to be running untrusted code. Dont forget applications, especially browsers. Researchers has been able to exploit meltdown and read non-cached data. The hackers can not be far behind. Combining Meltdown with a remote exploit could be disastrous.

As everything except Rasberry Pi is vulnerable to Spectre, focus on mobiles and tablets. Map you vulnerable devices and vendor updates. Pace yourself, we will be fighting this for years to come. There will be more side-channel vulnerabilities in the future and this is not the last round of patching.

Keep in mind that there are no microcode updates for Sandy Bridge processors yet. This is why Dell and HPE is holding back updates for Poweredge G12 servers and Proliant Gen8 servers. If you have a lot of Xeon 4600/2600/1600/1400/1200 V1 servers, you will have to wait on Intel and server vendors if you want full mitigation for Spectre.

VMware vSAN 6.6 – What’s New for Day 2 Operations?

VMware has just announced vSAN v6.6, with over 20 new features. While new and shiny features are nice I’d like to highlight a couple that I think might be undervalued from release feature-set perspective, but highly valuable in day to day operations of a vSAN environment, otherwise known as Day 2 operations.

vSAN Configuration Assist is one such new feature. While it’s true that it helps first time configuration of a greenfield installation with vSAN (no more bootstrapping, yay!), it also helps with Day 2 operations.

It helps configure new hosts added to an existing vSAN enabled cluster, but it also makes it possible to automate updating of IO controllers, both firmware and drivers directly from within vCenter. As everyone should know by now, vSAN is highly dependent on drivers and firmware being on supported levels. This improvement helps the improved vSAN Health Check (Enhanced Health Monitoring) alert you when new and verified drivers/firmware are available, and if the controller tools are available on the ESXi host, it can also update the firmware for you.

Directly from vCenter, utilising maintenance mode like you’re used to from patching your ESXi hosts from VMware Update Manager (even if it’s not integrated into VUM at this point).

This new features takes the vSAN HCL list to a new level, it’s no longer just a list over supported IO controllers and their firmware and drivers, it’s a now also a software distribution point. At the moment Dell EMC, Fujitsu, Lenovo and SuperMicro are all supported vendors for this new distribution model, hopefully the rest will follow suit quickly.

The second feature I would like to highlight as a Day 2 operations enhancement, is the new vSAN Cloud Analytics feature. If you participate, in the Customer Experience Improvement Program (CEIP), it will enable custom alerting for your environment, based on your own environment. For instance, if a new Knowledge Base article is published, that pertain to your specific setup, it will alert you about it. One example might be if you have Intel X710 NICs, which can cause PSODs — Wouldn’t it be nice if you got alerted that this might be an issue, and then told how to remediate it? Well, with vSAN 6.6 you’ll get exactly that.

With vSAN 6.6 you get both automated, and verified, firmware/driver upgrades, as well as proactive alerting for potential issues through the hive-mind that is the analytics service. This is what VMware calls Intelligent Operations and Lifecycle Management in this release, and it’s really hard to argue with that.

Of course, vSAN 6.6 provide other Day 2 Operations enhancements as well, like Degraded Device Handling (DDH), Simplified Stretched Cluster Witness replacement procedures, Capacity and Policy Pre-Checks and access to the vSAN control plane trough the ESXi Host Client, but I’ll leave those for later posts.

Cross vCenter VM Mobility Fling – macOS?

The VMware Cross vCenter VM Mobility – CLI was recently updated so I decided to try it out. In short, this little Java based application allows you to easily move or clone VMs between disparate vCenter environments.

The Fling is listed with the following requirements:

  • JDK 1.7 or above
  • Two vCenter instances with ESX 6.0
  • Windows : Windows Server 2003 or above
  • Linux : RHEL 7.x or above, Ubuntu 11.04 or above

There is no mention of macOS there, but I decided to give it a go any way, and it turns out that it works just fine on macOS as well!
Just make sure you have the Java JDK installed locally. When I ran it the first time, I got the following error, since the JAVA_HOME environment variable was not set.

[cc lang=”bash”]
~/Downloads/xvc-mobility-cli_1.2$ sh xvc-mobility.sh
set JAVA_HOME to continue the operation
[/cc]

This is very easy to fix, just run the following command in your terminal of choice, and xvc-mobility.sh should work just fine on your Mac.

[cc lang=”bash”]
export JAVA_HOME=$(/usr/libexec/java_home)
[/cc]

Next up is running the fling with the correct parameters (this is a clone operation, not a relocate):

[cc lang=”bash”]
~/Downloads/xvc-mobility-cli_1.2$ sh xvc-mobility.sh -svc [source-vcenter] -su [source-vcenter-username]
-dvc [destination-vcenter] -du [destination-vcenter-username]
-vms [vm-name] -dh [destination-host]
-dds [destination-datastore] -op clone -cln [destination-vm-name]

13:41:40.591 [main] INFO com.vmware.sdkclient.vim.Task – CloneVM_Task | State = SUCCESS | Error = null | Result = [email protected]
13:41:40.597 [main] INFO com.vmware.sdkclient.vim.Task – Monitor task end
13:41:40.597 [main] INFO com.vmware.sdkclient.vim.Task – CloneVM_Task took : 0:51:33.728
13:41:40.603 [main] INFO c.v.s.helpers.CrossVcProvHelper – Successfully cloned the vm:[destination-vm-name]
[/cc]

I was able to clone a VM from my lab in Bergen to my lab in Oslo, without any problems what-so-ever. Not only is that a Cross vCenter vMotion, but also a Cross Country one, awesome!

Now this is just an example, please check the official documentation for all the parameters, and what the tool expects.