Learning from AWS failure

Learning from AWS failure

/ / /
Published: October 23, 2013

Of all days it had to happen on Friday the 13th. Last month, AWS customers experienced a partial network outage that partially severed an availability zone’s connectivity to the rest of the internet. A month before that, there was another gray failure with a very black-and-white impact on a dozen well-known internet services.

The point is not to lament failures of public clouds. Everyone, on-premise, in the clouds or anywhere in-between, experiences failures. Failures are a fact of life. AWS’s just get more publicity. Instead let’s focus on the more interesting question: What can we learn from these failures?

AWS failure incident #1: EBS

At Datadog HQ we saw the announcement pop up that Sunday at 4pm EDT in one of the AWS-centric screenboards. If you’ve turned on the AWS integration on and you have instances in us-east-1, you have seen it too.

The initial AWS failure seen from Datadog

Our key metrics were not affected (thanks to our past experience) so we could stay in proactive mode and formulate some hypotheses on what the impact would be, should the outage spread further. For that we relied heavily on our own metric explorer (yes, Datadog eats its own dog food).

Later that day, we received in the event stream more details about the outage (emphasis mine).

AWS failure update seen from Datadog
[...] a small number of EBS volumes in that AZ to experience degraded performance, and a small number of EC2 instances to become unreachable due to packet loss in a single AZ in the US-EAST-1 Region. The root cause was a 'grey' partial failure with a networking device that caused a portion of the AZ to experience packet loss. [...] The networking device was removed from service and we are performing a forensic investigation to understand how it failed [...]

AWS failure incident #2: network outage in an EC2 zone

Here is the visualization of the impact of the network outage per zone, in terms of latency and throughput.

AWS Network outage visualized

The first indicator of an issue is here latency as shown as (1) on the top graph. Throughput in that zone degrades but only to a point. As latency increases, throughput continues to drop until all traffic has been drained from the zone and other zones pick up the slack (3). At that point latency recovers as no traffic is flowing to the affected zone. Eventually the problem is fixed (4) and everything comes back to normal.

And here is the corresponding status message from AWS (emphasis mine):

Between 7:04 AM and 8:54 AM we experienced network connectivity issues affecting a portion of the instances in a single Availability Zone in the US-EAST-1 Region. Impacted instances were unreachable via public IP addresses, but were able to communicate to other instances in the same Availability Zone using private IP addresses. Impacted instances may have also had difficulty reaching other AWS services. We identified the root cause of the issue and full network connectivity had been restored. We are continuing to work to resolve elevated EC2 API latencies.*

Lesson #1: “a grey partial failure with a networking device” can cause serious damage

The “grey” partial failure of just one networking device is likely among the most difficult ones to diagnose. The device or service is seemingly working but its output contains more errors than usual.

How can you tell if you are suffering from a partial failure? There is no magic bullet but one key metric to monitor is the distribution of errors with a particular focus on the outliers. In this case a high latency (greater than 10s) can be construed as an error in the service. The maximum, 95% and median of error metrics are good starting points.

This applies to network metrics, but also to any application-level metric. For instance, tracking the error rate or the latency of the critical business actions in your application will alert you to a potential “grey” failure: when the service seemingly works when you try it, but it does not for more and more of your customers.

If you use StatsD’s histogram you get these for free: here’s how you get can started.

Lesson #2: the API works until it does not

This is one common, second-effect pattern with shared infrastructure: an issue, experienced or just heard of, will cause a lot of people to use the API to run away from the disaster scene. Sometimes so much so that the API gets overwhelmed. In this particular case we can’t say that it was the case but it is risky to assume that the cloud will always have extra capacity.

In our experience any instance/volume in an effected zone should be considered as lost: you will save time by focusing on the healthy part of your infrastructure rather than trying to patch up the broken pieces.

Because you can’t always rely on the API to work reliably, your contingency plan should rely on infrastructure already deployed in another zone or region.

It goes without saying, but we’ll say it nonetheless: build for failure, don’t get attached to your infrastructure. It has a shelf life. The tools are now here to let you treat your infrastructure like garbage when you can’t rely on it.

P.S. If you want to increase the visibility in your infrastructure and operations, why don’t you ?


Want to write articles like this one? Our team is hiring!