Important note about the security flaw impacting ARM & Intel hardware

Post image

Dear Cloud Riders,

Several days ago, we became aware of a security vulnerability impacting x86 and more recently ARM processors used by Scaleway and other cloud providers.

While trying to get a solution to address this vulnerability as fast as possible, we faced communication issues with Intel which was deliberately restricting & filtering the information about the bug.

Due to the criticality of this bug, our Security team proactively took the decision to perform a major security update on all our hypervisors (ARM & X64).

We will perform a security update tomorrow of all impacted hypervisors and will need to reboot servers running on top of them.
A maintenance window has been scheduled between the 01/04/18, starting at 7am UTC and the 01/06/18, ending at 7am UTC.

During this maintenance, servers running on top of impacted hypervisors will be unavailable for a few minutes during the reboot phase.
We will reboot cluster one at a time to limit downtime on your infrastructure.

We sincerely apologize for the short delay of this notice, we believe security and privacy is crucial on cloud platforms and we decided today to trade some availability in favor of security.

Additional information

The Scaleway Security Team

This blog post may be updated in the coming hours.

Update #1 - 01/04/18 11am UTC

According to the latest update from Intel, a microcode is required to completely fix the bug. The microcode release date is, at this time, scheduled for an undisclosed confidential unacceptable late date. Due to the emergency, we decided to perform a first reboot of the platform to update the hypervisor Kernels right now, even if we need to perform a second one when the microcode will be available.

We will start by patching our Workload Intensive hypervisors in the coming hours.

According to the latest update from Cavium, ThunderX SoC are NOT vulnerable. We're still waiting for a more throughout update.

Update #2 - 01/04/18 12pm UTC

We just released the 4.14.11 Kernel so every Scaleway customers can move to a fixed kernel. To upgrade your server kernel, simply change your bootscript by selecting the 4.14.11 rev1 and then reboot your server. A soft-reboot from the OS is sufficient to apply the change.

If you have a large scale deployment, this can of course be automated with our CLI. Checkout the snippet to perform the operation.

Update #3 - 01/04/18 2pm UTC

We are currently sending mailing to Online Dedibox customers to inform them about critical security vulnerabilities affecting many CPU architectures (CVE-2017-5753CVE-2017-5715, and CVE-2017-5754).
To our knowledge, no fixed kernels are officially shipped in any distribution but we encourage you to regularly check for security updates to perform an upgrade of your kernel once available.

For more information regarding these vulnerabilities, checkout the following links

Update #4 - 01/04/18 3pm UTC

We are starting to fix a small batch of Starter Cloud and Workload Intensive hypervisors and plan to increase the deployment of the patch in the coming hours depending of our initial results.

Update #5 - 01/04/18 4pm UTC

Online Web Hosting Platform fix deployment is in progress.
During this operation, expect a few minutes of downtime during the reboot phase of the core infrastructure servers.
We are carefully monitoring the performance impact.

Update #6 - 01/04/18 7pm UTC

From the latest news we got from concordant sources:

  • We have the confirmation that Meltdown is fixed by upgrading the kernel including KPTI starting with the 4.14.11 kernel version
  • The combination of the kernel update and microcode completely fix Meltdown & Spectre vulnerabilities. At this time, we do not have any microcode available for any of our Online Dedibox and Scaleway cloud servers.
  • Concerning the performance impact, we now know that both, the microcode upgrade and the kernel upgrade, will generate a non negligible performance impact, especially with IO intensive applications. We have no precise idea of the performance reduction at this time.

We decided to delay the reboot of all hypervisors until tomorrow 11am UTC to avoid multiple infrastructure downtime and fix both Meltdown and Spectre issues at the same time.
This operation may be re-scheduled depending of the information we get in-between.

For customers using BareMetal servers, we will also need to apply the microcode on each server to secure from Spectre.
Earlier today, we shipped a 4.14.11 kernel available via bootscript. We strongly encourage you to update your server' kernel and reboot as soon as possible.

Update #7 - 01/04/18 8:30pm UTC

Several days ago, we became aware of a security vulnerability impacting x86 and more recently ARM processors used by Scaleway and other cloud providers.

Since 72 hours now, our security & SRE teams are working to understand & eradicate the Meltdown and Spectre vulnerabilities impacting our servers and all CPUs worldwide.

Due to the incomplete information provided by hardware manufacturers, we joined forces with other potentially impacted cloud providers including Linode, Packet, OVH and created a dedicated communication channel to share information and work all together to address the Meltdown & Spectre vulnerabilities.

A few minutes ago, we got confirmation from Supermicro that they will deliver a microcode upgrade for our Workload Intensive servers tomorrow evening. We should then be able to totally secure our Workload Intensive hypervisors.

At this stage, we are also working about these issues for all our other offers.

Update #8 - 01/04/18 10:55pm UTC

Our current understanding of the situation is that, on Intel CPUs:

  • Spectre 1 (bounds check bypass, CVE-2017-5753) is both hardly exploitable and hard to patch but with a limited impact.
  • Spectre 2 (branch target injection, CVE-2017-5715) will be fixed using a microcode update on the short term (the exact delay depends of Intel and the server manufacturers) with a performance impact. We currently have no confirmation on the exact Kernel version needed to work in conjunction with the microcode update. Our current assumption is that the fixes needed are not yet in the main kernel tree and that it will be merged in 4.15. On the longterm, the vulnerability could be fixed by Retpoline to reduce the performance impact but due to large amount of work (everything needs to recompiled), it will probably not be available before several weeks.
  • Meltdown (rogue data cache load, CVE-2017-5754) is completely mitigated by the KPTI patches merged on 4.14.11

We will continue to deploy patches to solve the Meltdown and Spectre issue during the coming days. Our ability to resolve the Spectre 2 vulnerability directly depends of the release speed of both Intel and the manufacturers.

Next upgrade tomorrow 9am UTC time.

Update #9 - 01/05/18 9:30am UTC

Since yesterday evening, we are actively tracking the Linux Kernel tree and are currently waiting for the IBRS patches to be merged.

On the distribution side:

At the same time we are investigating on the QEMU & KVM sides to understand the complete mitigation process to totally secure both the Guest and Host from all vulnerabilities.

We expect to receive the first microcode updates from our hardware providers to mitigate Spectre 2 in the coming hours.

Update #10 - 01/05/18 12:10pm UTC

Last night, Digital Ocean, Vultr, Nexcess, prgmr.com joined our response communication platform to centralize efforts.

From our latest information, it seems that variant 3 (Meltdown) can not be exploited to cross VM boundaries on KVM due to the way memory is managed. A guest can not read memory of the hypervisor nor of another guest VM, even with virtio.
At this stage we believe variant 2 is exploitable on KVM, we are still investigating.

That means that all Scaleway cloud riders can already protect their servers from Meltdown by upgrading their servers bootscript https://www.scaleway.com/docs/bootscript-and-how-to-use-it/.

We just received from Dell the microcode update for R730 and R730XD servers.
If you have a running Dell R730 and R730XD (Dedibox server), in the coming hours, you will be able to reboot your server to apply the microcode.

Important: note that the microcode doesn’t fix the vulnerabilities without the kernel update. At the moment fix kernels are not yet publicly distributed

We will send an email to all our Dedibox customers when we will get all the microcodes and the updated kernels.

At this time we did not receive any microcode or information from other hardware vendors including: HP, QCT, IBM, Cavium

(Dedibox related) Update #11 - 01/05/18 2pm UTC

The microcode for Dell R730 and R730XD servers (Dedibox ENT SATA 2015, ENT SSD 2015, mWOPR SATA 2015, mWOPR SSD 2015, WOPR SATA 2015, WOPR SSD 2015, ST12 SSD 2016 and ST24 SSD 2016) has been deployed.

If you have one of the server listed above, you can reboot to apply the microcode. It’s a permanent microcode fix from BIOS!

Important:

  1. The reboot can take up to 15 minutes due to the microcode update.
  2. Note that the microcode doesn’t fix the vulnerabilities without the kernel update. At the moment kernel fixes are not yet publicly distributed

We will send an email to all our Dedibox customers when we will get all the microcodes and the updated kernels.

We received the QCT microcode for X10E-9N (Dedibox LT and MD 2017). We expect to deploy in a few minutes.

90% of our shared hosting platform is patched against Meltdown.

Update #12 - 01/05/18 3pm UTC

The microcode for QCT server X10E-9N (Dedibox LT and MD 2017) has been deployed.

If you have one of the server listed above, you can reboot (a soft-reboot from the OS is sufficient to apply the change.) to apply the microcode. It’s a permanent microcode via BIOS update!

Important:

  1. Unlike the live microcode update, the microcode fix via BIOS upgrade is permanent and is not distribution dependant.
  2. The reboot can take up to 15 minutes due to the microcode update.
  3. Note that the microcode doesn’t fix the vulnerabilities without the kernel update. At the moment kernel fixes are not yet publicly distributed

We will send an email to all our Dedibox customers when we will get all the microcodes and the updated kernels.

We received the DELL microcode for DSS-2500 (Dedibox ENT 2016, WORP 2016, mWOPR 2016 and ST48 2017). We expect to deploy in a few minutes.

Scaleway ARMv8 servers

From our latest information, Cavium ThunderX should not be impacted by Meltdown, Spectre 1 and Spectre 2. It means that the complete range of Scaleway ARMv8 servers is not affected by any of the three vulnerabilities.

Update #13 - 01/05/18 4:45pm UTC
  • Dedibox & Scaleway - Starting now, we will maintain the status of all our server ranges via the table under the global overview section. Important note: BIOS for X10E-9N, DSS1510, DSS2500, R730 are already pushed and available via a single soft reboot

Important:

1/ Unlike the live microcode update, the microcode fix via BIOS upgrade is permanent and is not distribution dependant.
2/ The reboot can take up to 15 minutes due to the microcode update.
3/ Note that the microcode doesn’t fix the vulnerabilities without the kernel update. At the moment kernel fixes are not yet publicly distributed

  • Web Hosting - 99% of our shared hosting platform is patched against Meltdown.
  • Scaleway Customers Kernels - We are building 4.14.12 and LTS 4.9.75 & 4.4.110 kernels. Kernel 4.14.11 is available via bootscript since yesterday 3am UTC. The 4.14.11 is battle tested and fixes the Meltdown vulnerability.
  • Scaleway ARMv8 - Cavium confirms that ThunderX is not affected at all by Meltdown, Spectre 1 and Spectre 2.
Update #14 - 01/05/18 6:15pm UTC

Online Cloud Web Hosting

Tomorrow morning, we will upgrade the Online Cloud Web Hosting platform.
During this maintenance, expect a few minutes of downtime during the reboot phase of the servers. 
We are carefully monitoring the performance impact.

Scaleway X64 Workload Intensive servers

A few minutes ago, we received a release candidate of the Supermicro patched BIOS including the fixed microcode. We are currently testing and validating this.
We plan to upgrade X64 Workload Intensive hypervisors as soon as fixed kernels are publicly distributed.

Dedibox Classic 2016

The patched BIOS for X10SDE server (Dedibox Classic 2016) will be deploying once we finish a short validation. We will update the table when effective.

Update #15 - 01/05/18 23:30 UTC

We released the 4.14.12, 4.9.75 and 4.4.110 kernels available via bootscript for all x86-64 Scaleway servers. To secure from Meltdown, simply change your bootscript by selecting any of these kernels and then reboot your server. A soft-reboot from the OS is sufficient to apply the change.

If you have a large scale deployment, this can of course be automated with our CLI. Checkout the snippet to perform the operation.

Next upgrade tomorrow morning UTC time.

Update #16 - 01/06/18 10am UTC

Online Cloud Web Hosting

We are currently upgrading the Online Cloud Web Hosting platform to secure from Meltdown.
During the maintenance, expect a few minutes of downtime during the reboot of the servers. 
We are carefully monitoring the performance impact.

We will update the status when the maintenance is done.

Update #17 - 01/07/18 11am UTC

Dedibox

We have received several BIOS including the updated Microcode from SuperMicro yesterday night at 2:30am UTC.
Our team validated and deployed yesterday the BIOS on the following Dedibox offers:

  • Dedibox Classic 2016
  • Dedibox LT 2016, Dedibox MD 2016

If you have one of the server listed above, you can reboot (a soft-reboot from the OS is sufficient to apply the change.) to apply the microcode. It’s a permanent microcode via BIOS update!

Important:

  1. Unlike the live microcode update, the microcode fix via BIOS upgrade is permanent and is not distribution dependant.
  2. The reboot can take up to 15 minutes due to the microcode update.
  3. Note that the microcode doesn’t fix the vulnerabilities without the kernel update. At the moment kernel fixes are not yet publicly distributed

Scaleway Workload Intensive servers

The microcode update for the Scaleway Workload Intensive Servers is now completely validated, we're waiting for the kernel level patches to update our fleet.

Update #18 - 01/08/18 12pm UTC

Two new cloud providers AWS, Tata Communications and core members of the Red Hat and Ubuntu team joined the task force!

Scaleway Cloud Platform and Online Web Hosting

We are currently working on:

  • improving the Hypervisor upgrade process to reduce downtime
  • Spectre 2 mitigation using an IBRS enabled Kernel (still waiting for patches to be merged) + Microcode Upgrade
  • Retpoline Testing
Update #19 - 01/09/18 5:00pm UTC

Today, Moritz Lipp, Daniel Gruss, Michael Schwarz from Graz University of Technology who are one of the team discovered Meltdown & Spectre, joined our response communication platform to provide us additional details about the vulnerabilities.

As the 180 days embargo they provided ended today, they just released a Meltdown proof of concept. https://github.com/IAIK/meltdown

On our side, we are still waiting for hardware vendors microcodes, we should receive them in the coming hours for the Scaleway platform.
We're tracking and testing the kernel IBRS patches and will start to upgrade the Scaleway platforms once we've fully qualified them.

Update #20 - 01/10/18 11:30am UTC

Following the embargo lifting, the official proof of concept for Spectre 2 from Google was disclosed https://bugs.chromium.org/p/project-zero/issues/detail?id=1272. Following analysis, we believe the PoC is only working on a specific software configuration. As far as we know, the Scaleway platform is not impacted but our security team is still investigating & working hard to confirm.

Meanwhile, we are performing benchmarks to identify performance impact introduced by the Meltdown mitigation, we will release them later this week.

Important: Ubuntu update kernel to mitigate Meltdown is not working according to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742323. We strongly recommend you to do not update your kernel to 4.4.0-108.

Important2 (01/10/18 12:20pm UTC): kernel 4.4.0-109 has been pushed and fix the problem.

We are actively investigating all possibilities to mitigate Spectre 2, mainly IBRS & Retpoline.

Update #21 - 01/11/18 3:30pm UTC

Dedibox

Important: We were notified by hardware vendors that Intel' microcode updates for PRO 2016 (DSS1510), Classic 2016 & LT2014D/LT2015D (R220) may cause instabilities. We've stopped the deployment of these microcodes and are waiting for new ones to mitigate the Spectre 2 vulnerabity.

Web Hosting

Following the kernel upgrade we did a few days ago to mitigate Meltdown on the web hosting platform, our SRE team detected performance issues impacting the service. We've scheduled a maintenance window starting today at 4:00pm UTC to upgrade some of our servers to more powerful ones and solve the performance issues. This maintenance will last for about 30 minutes per site.

Update #22 (Web Hosting Important Performance Issue) - 01/11/18 6:00pm UTC

Web Hosting

We finished the first part of the maintenance and upgraded in emergency some servers to more powerful ones. Our teams are carefully monitoring the performance impact.

Meanwhile, our customers faced major performances and stability (segfaults, OOM, etc.) issues. due to this situation, starting tomorrow, we will migrate all Web Hosting servers to more powerful ones as we don't think a fix to address the PTI performance regression will be available in the coming days.

During the maintenances, a downtime of 30 to 60 minutes per site is to expect.

Following are the graphics from the past days showing the pre/post kernel upgrade (with & without PTI enabled). We clearly see the performance impact, bigger than expected, when upgrading to 4.41.11 with PTI enabled. Main workload on those servers is PHP.

Our teams is working hard to find a way to address the performance issue as fast as possible.

Update #23 - 01/17/18 6:00pm UTC

Last night, we received the microcode for both C2 baremetal servers and VC1 hypervisors. Our SRE team is currently testing and validating the new microcode to ensure everything is stable.

Meanwhile, the retpoline patches to mitigate Spectre 2 have been integrated in the 4.15 tree and back-ported in 4.9.77 & 4.14.14 kernels. The 4.9.77 and 4.14.14 are already available via bootscript.

The GCC retpoline patches have been back-ported into the GCC 7 branches.

We are waiting for retpoline patches to be released in GCC and will start recompiling kernels with the correct version of GCC as soon as it's available.


To get a global overview per server family and operating systems about Spectre and Meltdown mitigation, checkout out dedicated status page https://www.scaleway.com/meltdown-spectre-status/.

Author image

Edouard Bonlieu

Strategy and marketing at Online.net & Scaleway