The Meltdown/Spectre Saga: The Impact Across Millions of Cores | Datadog

The Meltdown/Spectre saga: The impact across millions of cores

Author Stephen Kappel
Author John Matson

Published: 1月 29, 2018

On January 3, we experienced an unpleasant surprise across a cluster of large Redis instances in our infrastructure. Without any clear reason, the instances suddenly started to run much hotter, which required us to scale out. We realized then that the spike may have been tied to the hasty disclosure of the Meltdown and Spectre attacks and the performance impacts of the subsequent patches.

CPU metrics on Redis instances after Meltdown patch versus the week prior
System CPU for a Redis cluster around the time of the Meltdown patch (red line), as compared to the same metric one week before (dotted gray line).

While everything went roughly back to normal in our cluster a week or so later, we wanted to know how widespread the issue was, so we decided to look into all the cores that Datadog monitors. Read on to see the actual impact on millions of cores.

Averaged impact was small but measurable

System CPU usage on average host

Among the several CPU metrics collected by the Datadog Agent, the effect of the patches was most noticeable in system.cpu.system, the percentage of time the CPU spends in kernel space. This is not surprising: kernel page-table isolation, which mitigates the Meltdown vulnerability, introduces some overhead every time a program makes a system call.

Starting on January 3, when Meltdown and Spectre were disclosed ahead of the scheduled embargo, the average system.cpu.system across monitored hosts increased noticeably above baseline values. Nine days later, the metric’s value returned to normal, presumably as updates were rolled out to address performance issues with the initial patches.

Although the average impact on system.cpu.system was relatively small, accounting for an increase of less than 1 percent in total CPU utilization, the fact that the impact is clearly observable across so many cores, running dramatically varying workloads, shows how widespread the issue was.

Greater impact for hosts with more system CPU usage

System CPU usage on hosts with high cpu.system

We expect that kernel page-table isolation will disproportionately affect workloads that frequently call the kernel. And indeed, our data shows that CPUs that typically spend more time in kernel space saw higher-magnitude impacts. For hosts in the 99th percentile for system.cpu.system, the increase consumed an extra 4 to 5 percent of total CPU resources.

Impact by workload class

System CPU usage on different instance types

Using cloud instance types as a proxy for the workloads being run on those instances, we can compare the impact on compute-intensive tasks against other workloads. The spike in system.cpu.system was most pronounced in compute-optimized and general-purpose virtual machines, with a less significant but still clearly detectable increase in memory-optimized instances. The elevated CPU levels across instance types, especially for compute-heavy workloads, speak to the systemic effects of the security patches.

Seeing the big picture

Of course, when we detected an impact on our own infrastructure via unusual CPU spikes in our Redis cluster, we were not alone. Many benchmarks, anecdotes, and reports have described significant performance hits in patched kernels and hypervisors, especially for compute-heavy workloads. The large-scale analysis from our unique vantage point adds a new dimension to these reports by confirming the widespread effects of the patches.

More broadly, the vulnerabilities appear to be the first of a new class of security issues with a profound performance impact. Whether others will follow remains to be seen.