The Anatomy of a Cascading Failure | Datadog

The Anatomy of a Cascading Failure


Published: 7月 17, 2019
00:00:00
00:00:00

Introduction

Rares: Thanks.

Hi, everyone.

My name is Rares.

I currently work as tech lead for N26.

For those of you who don’t know us yet, we’re a digital bank, which means that we don’t have any physical branches and all of our operations and all of the client interactions are happening through our app or through our customer support, so entirely online.

We’re currently active in the EU and UK for a while and we’re really excited to have launched in the U.S. last week, so we’re open for beta.

Yay.

I strongly encourage you to give us a try if you want a banking experience that’s completely different from what you’ve had: really nice user experience, no hidden fees, no tricks like maintenance cost or minimum balances, or so on.

It’s really good.

Also, we’re hiring in the New York office, so if you’re interested, please check out N26 careers, or talk to me, or to some of my colleagues here during the break.

What is a cascading failure?

So, yeah, enough about N26.

Today, we’re gonna talk about cascading failures more specifically how they happen and what they are.

What are some strategies and patterns to mitigate them and what are some toolings you could use to actually implement these strategies in your backend microservice architectures?

When I was preparing for the talk, Datadog were nice enough to hook me up with a speaking coach and this guy told me, “Hey, start with a story. Something that would make for a really good headline.”

And I found it really hard to do because actually the whole talk is about how not to be in the headlines.

And in my defense, I didn’t know Cloudflare were sponsors for…

Yeah, I’m not trying to bash them though they’ve done a great job with releasing a postmortem for this.

It’s really detailed, it’s really professional, please have a look at it because it gives you a lot of hints into what not to do when running at that scale, so kudos to them.

Also to be fair to everyone, all the other examples are gonna be either from our staging or live environment.

So we’re gonna take a look at some of the times we fucked up.

Sorry, can I say that?

No.

Triggering conditions: what happened?

So let’s start dissecting what makes a cascading failure and let’s start with exhibit A, and I’m gonna ask you here, what do you think happened?

And this is a very vague question because the answer is also very vague.

So what we have here is the read/write operations that are going into an RDS.

RDS is just AWS speak for a relational database that’s managed.

In this case, it’s a Postgres instance and you see a huge spike in the number of write requests.

So what do you think happened?

Just throw something at me.

Sorry?

Man: [Inaudible 00:03:02]

Rares: Yeah.

So I told you the answer is gonna be quite generic and one of the main triggers that encourages or contributes to a cascading failure is change.

Change, as you know, is completely inevitable in our industry.

Change (and the speed of change) that you can actually have in your organization is directly related to the success of your organization.

So slowing down things is not an option.

It’s just a matter of how you account for treating change in a better way that will lead to your success.

And this particular case, it was a roll-out of a piece of code that wanted to switch from reading some user information from the cache, which in this case was a Postgres instance to reading it from the upstream, which was a different service.

What the person did was they reused the update method, which both read from the upstream and stored the information in the database.

Hence, the huge number of write operations.

Other types of change include I don’t know, different planned changes, traffic drains from one data center to the other, etc.

Exhibit two, and this is just zoomed out version of the slide you’ve seen before.

So you see the release happening here and then you see a huge drop.

What do you think this was?

No because releases would be highlighted with one of this pinkish lines.

Good shot though.

Man: [Inaudible 00:04:33].

Rares: No.

Man: [Inaudible 00:04:35].

Rares: Sorry?

Man: [Inaudible 00:04:36].Rares:

No.

Are you familiar with AWS Burst B alance?

Yeah.

So it’s this thing where AWS has built-in limits for how many huge numbers of requests you can send it at one time.

It can allow you for some space to handle burst balance, but if you are not being a good citizen for long enough, it will just throttle you.

So in this case, this was another potential trigger for a cascading failure.

They just like completely shut down all our write operations.

One more.

So what you have here is above you have the number of logins that are going to our system and below you have a service that should have more or less directly proportional traffic with the number of logins.

But it’s happening and you see all these spikes going in from time to time.

What do you think happened?

Maybe check the time for a hint.

Man: Cron job.

Entropy

Rares: Yeah.

Oh, you have two points.

Congratulations.

Yeah.

So it is a cron job.

It’s happening every 15 minutes and I put this trigger condition under the general category of entropy.

Other examples of entropy could include DDoSs or instance death.

In this particular case, it’s just the extra traffic that’s added by a cron job and what you see there with this huge spike is a very interesting combination of asynchronous DDoS or that’s how I call it.

What this cron job did was it took some file from a partner, processed it, and moved on.

In this particular case, the file that came from our partner was huge.

So basically, one of our partners just sent us a huge file and this introduced a massive amount of traffic in our backend infrastructure.

So, yeah, watch out for that.

Fundamentally, all these triggers are working against some finite set of resources, CPU, memory, Vespene gas, what have you.

And this is one of these fundamental constraints that you have to work with every day.

And the problem is that these resources are also not independent from one another and here’s an example of how that happens.

So let’s assume that you have a frontend server and you start messing with the garbage collection.

So you start tuning the JVM to experiment with some garbage collection.

You’ll have poorly tuned garbage collection.

This will increase the CPU because of the garbage collection or additional garbage collection.

Then you’ll have slow requests, which will lead to more requests being processed, which will obviously lead to queuing, which will lead to less RAM for caching.

Remember this is a frontend server, which means that you’ll have less caches, which means that you’ll have more request going to the backend, which means that you have a higher potential of trouble.

So as you see resources are interdependent to each other, and when everything goes bad, you get a blue screen of death in the cloud.

This can manifest itself in various different ways or an instance death can manifest itself in many different ways.

You can get a lot of 500s, timeouts.

Ultimately, these instances will be taken out of low balancing rotations and then you have the premise of a cascading failure.

And one of the main ways in which these cascading failures propagate is by redistributing load between instances.

So let’s just say that you have this example where you have an ELB that’s redirecting traffic more or less evenly across two instances and then one of them suffers from one of these blue screens of death.

So then the traffic on instance A doubles, which means that instance A is also more likely to suffer from one of these blue screens of death and you know this will propagate across your entire infrastructure.

This is a nice one actually.

So the 500s here trigger the traffic spike and not the other way around and it’s because we have retries that have gone haywire.

They’ve gone out of control.

They are not properly limited in some way.

So what we have here is like 500s trigger retries, which in turn trigger more 500s because of the extra traffic, which in turn trigger more retries and so on.

Latency creep

Another thing, latency creep.

So let’s say that you have service A calling service B and on the graph below, you see that the latency is increasing for this service.

And on top, you have the TCP connections that are established between these two.

They are communicating via HTTP pool.

They’re using HTTP pools to call each other, so the underlying TCP connection is reused, but unfortunately, because you have extra lag and some of these connections are actually waiting on I/O, you have to establish new connections, so the graph above is symptomatic of what happens.

Now, let’s assume that this happens in this setup.

So you have traffic going from service A to service D to service E, fast, no problem, but you have this slowness happening between B and C, which means that service A is also gonna be slow calling service B and it’s also gonna be more likely to retry.

And this will slowly propagate to your clients as well, which means that the client is more likely to retry the call to service A, which will also call service D and service E as part of the retry.

So you’re introducing additional traffic in your infrastructure and this will make it more likely for your instances to run out of resources and again, run into a cascading failure.

Last, but not least, resource contention during recovery.

So let’s just say that you have some…in this particular case this was a low testing or staging environment.

So we had everything maxed out.

CPU was super high.

So we had an auto-scaler that kicked in, but it kicked in really slowly.

So what happened is that after 10 minutes it spawned a new instance and this instance took a while to spawn and after that, it started receiving traffic.

But it started receiving traffic really fast and after that it died immediately.

So if you don’t scale up fast enough in one of these situations, you’ll just bring these tiny resources to the slaughter and they’re not gonna help in any way.

The same goes with when you’re trying to throttle or reduce traffic going into instances that are overloaded.

It’s exactly the same thing.

So what about ways to improve it.

Water break, sorry.

And one of the first things you can do is actually eliminate the number of synchronous service-to-service calls that you have in your infrastructure.

Basically, this is switching from a pattern called orchestration to a pattern called choreography.

You might be familiar with choreography because it’s all the hip CQRS event sourcing people are trying to push it down on you.

It’s not the panacea, it’s not the silver bullet, but it does help in many situations.

There’s significant overhead, you should be really careful when you choose one over the other, but this thing will really help reduce for example the likelihood that if the user service in this case is down, there will be problems that will be affecting other services.

It’s just the events will queue in your message bus (Kinesis, Kafka, whatever) and user service will be able to process them when it comes back up.

Also, if the user service temporarily has too little resources, you have a bit more time to add and to auto-scale it, which brings me to my next question.

Do you need capacity planning?

Which one of you are doing capacity planning, but in a formal process in your companies?

Okay.

So you could argue that you need capacity planning, right?

Because you need to make sure that you have enough resources to process the entire traffic that you have.

And it’s true, but I would argue that not everybody needs a formal process around it.

So if you’re Google, yeah, probably if you handle hundreds of thousands of requests every second, you need to do a formal process before you launch something into production.

Unfortunately, this process is gonna be very costly and it’s very likely to be imprecise.

So I wouldn’t invest too much time into it especially if you’re operating in the cloud and there are other things that you can do that we’re gonna discuss in the next slide.

A special case in this situation is people are who are deploying on-prem, where again, you kind of do need to have some capacity planning because your IT probably has to order servers and such.

So I would argue that these are some things that are probably a better use of your time than doing advanced capacity planning and probably the most important one is to automate provisioning and deployment.

You know, the “cattle not pets” paradigm as well: make it very cheap for people to either turn on or off instances, spawn new instances, bring them down, make it very cheap, make it very casual, and build infrastructure around that.

Also, together with this, make it automatic as much as possible.

Make sure that your instances can react to increases in traffic and the auto-scale either up or down or also try to auto-heal.

Here, again, I’m coming back to publisher subscriber.

It’s another place where capacity planning is not needed or is not as needed because publisher subscriber is better at handling situations where you don’t have capacity on one of your instances.

And really, really important: make sure you’re optimizing for the right thing and also make sure that you have SLIs and SLOs that are built around this optimizing for the right thing.

Ask your business people what is the most important thing for your business, monitor it closely, set SLOs for this, and make sure that if these SLIs are measuring some degradation, and that if you’re getting closer to violating the SLOs, you take action immediately or as soon as possible.

Chaos testing

Another one that we’ve started experimenting with is chaos testing.

This is a practice of introducing problems into your live environment intentionally in order to see how your resilience features are performing.

And it’s very important to isolate these experiments in such a way that you can also recover from them and you won’t break down everything.

It’s still something that we’re learning on.

There’s lots of really amazing resources online and it’s something that’s really catching on in the industry, so I totally advise you to give it a try as well.

Retrying

Retrying.

This one is something that many people do, a lot of people do badly, and we were some of these people in the past.

First of all, let’s discuss what makes a request retriable, and I cannot emphasize enough how important idempotency is in this.

If your endpoints are not idempotent, your clients cannot know for sure if they can or cannot retry because if they’re not idempotent, a retry might result in a double-write to their database.

If you’re dealing with money like we are, this would be something that we cannot afford to risk, so please, please, please have idempotency implemented in your endpoints.

Also don’t do clever things.

Don’t do stuff like GET with side effects.

Respect the semantics of HTTP verbs and don’t violate the expectations that your clients might have when it comes to these semantics.

The last one is kind of weird.

If you can be stateless, please be stateless.

I understand that there might be situations where you cannot be stateless, but if you can and I think that you really should challenge yourself be stateless all the time, then please do so because it makes it easier for you to implement these things like retries.

Timeouts are a bit of a more complicated topic that we’re gonna look at in the next slides.

Another thing that’s often overlooked when retrying is jitter.

So many people do exponential back-offs when they retry, but if you look to my left here, you will see how these exponential back-offs tends to align into clusters of bursty traffic.

Which means that during these clusters of bursty traffic, you are more likely to overload your service temporarily, which makes it harder for the service to recover.

If you had jitter, where jitter is just some randomness in the time between retries, you’ll have this more even distribution of requests that are going to your failing microservice.

This is super-important and something that will help you heal your service better or more quickly.

Finally, make sure you have some sort of retry budgets and everybody’s familiar with some simple way of retry budgets, retry at most three times.

But I’d propose to you to do something else, which is to look at your instance and see how many of the requests that I’m sending to one particular upstream are retries?

If more than 10% of the requests are retries, it’s quite likely that this service is throwing a lot of 500s and retrying some more will not help it heal.

So try to throttle sending additional retries to this upstream.

Throttling and setting timeouts

Now, we get to timeouts.

So what’s wrong with this?

If you look at the timeouts between service A and service B and service C, they decrease progressively, but the one for service C is higher than that.

Which means that if service D is slower to respond to service C’s request, it means that service B will time out way faster and it will retry.

And those tiny people with the hard hats are work that’s actually unnecessary because service B has already timed out.

So this thing will propagate backwards into your service A.

So service A will also likely retry at some point and service B will not know that’s its retrying, so it will continue to do unnecessary work, which will end up overloading your backend infrastructure unnecessarily.

So try to be disciplined when it comes to setting timeouts and try to time out as early as possible.

Don’t leave timeouts that are too long because they really don’t work as timeouts after that.

And the problem with being disciplined is that it’s very hard.

So let’s say that you were disciplined in the beginning because you saw this talk and after that you were like, okay, we’re gonna revamp the way we do timeouts.

You were disciplined, but after that, new people came to your company because it became super successful and you have this other flow, which had a completely different timeout setting policy, so this kind of brings everything to the previous state again.

So what I would advise you is to propagate timeouts and there are some frameworks that would help you with this.

I think gRPC is one of the frameworks that actually helps you propagate timeouts between calls.

If not, you can just build your own solution for that.

Otherwise, also try to avoid nesting as much as possible and again, publisher/subscriber helps a bit because you have less services that are depending on each other in these, like, very complex call stacks.

For the love of God, don’t do this please.

Not only for the timeout reasons, it’s very hard to argue about the behavior of this particular setup if service C, for example, goes down or is slow, so please, please, please try to avoid it as much as possible.

Another thing you could is rate limit and this is where we’re going from client side throttling to service side throttling.

Me, as a server, I want to protect myself from abuse from other callers.

So you can assign something like client budgets, which are, in their simplest forms, this particular client can actually call me X amount of times during a one second or one-time interval.

The problem with that is that requests are not equal. You might have some health check that’s a very cheap call. You might have some other request that’s very computationally heavy.

So that’s why you should calculate per client limits in terms of some sort of hardware resource. Try to translate the actual calls that clients make into some sort of CPU or memory or some other hardware resource, so that you have an even playing field when setting these budgets.

Throttling: circuit breaking

A very popular strategy is to use circuit-breaking.

I’m not gonna go into too much detail around this, but you have this thing in between the call from service A to service B.

If the circuit-breaker detects some sort of degradation, either 500s or timeouts, it will open itself once these timeouts goes beyond the specific threshold, and it will not let any requests go through.

Except for after a while, letting some of them as probes, and if the probes turn out to be successful, then starting to let traffic go through again.

Adaptive concurrency limits

As you probably know, circuit-breakers were heavily popularized by Hystrix, which is a library from Netflix.

Unfortunately, Hystrix was decommissioned early October last year, I think, and Netflix came up with something else.

And they called it “adaptive concurrency limits,” and I’m gonna try to explain it to you here.

This is an algorithm that’s based on TCP congestion control and the basic assumption is that when queuing forms, you will get higher latency.

So you have concurrency, which is the number of requests that your service can actually process in parallel, and queues, which are all the other requests that are coming in—and they cannot be processed in this concurrent manner.

So when you have queuing, you have lag and what they do is they developed this formula (a really simple one), where they calculated the gradient, which is the ratio between the lag without any load versus the actual lag that’s read at this point in time.

And this is multiplied by the current limit and to it, you will add a queue size to account for bursty traffic.

And what this means is that if the gradient is smaller than one, which it would be if the current lag is higher than the no load lag, it will just decrease the limit.

And eventually, the see-saw pattern will end up being very close to the actual limit that you have for your service-to-service calls.

Fallbacks and rejection

If everything else fails, what do you do?

And I think Ross had a really cool example of this, which I also use in my talk.

One of the best things to do as far as the user is concerned is to just to return the results from a cache.

So many times probably if this is not the value that updates quite often, returning from a cache is very cheap and very nice from the user perspective.

At the same time it’s very complex to implement many times and it will be costly.

Same for writes, if you have some writes, you can just park them in a dead letter queue until your service becomes healthy again and just process them later.

Another thing you could do is return hard coded value.

And again, I’m gonna go back to Ross’s example, where in the BBC, they have a recommendation engine.

And for example, if their recommendation engine is down, they could just return something like a very generic set of recommendations that everybody likes like the Avengers or something.

If everything else fails, you can return with an entry response or an error.

Make sure that this error is as descriptive as possible and that it points the user to some sort of resolution like to your customer support phone number or something like this.

Also, it’s very important to discuss the tradeoffs of how expensive it is to implement cache versus the actual benefits that we would get in case of a failure and discuss these with your product owners.

Sorry.

Choosing the right tools

Last but not least, let’s look at some tooling that you could use for implementing these things in your backend.

And this is kind of the landscape.

It’s a bit biased towards the JVM, but this kind of the two main categories of tooling that you could use to implement this.

One is libraries and frameworks like Hystrix, like this adaptive concurrency limit from Netflix, like Sentinel, Resilience4j and so on.

These are, in this particular case, they’re jars that you add to your service and in your code, you wrap your external call in some sort of data structures that are provided by these libraries.

The hype train is moving towards something else, which is side-car proxies.

Everybody’s implementing service meshes and proxies like Envoy for example have excellent support for resilience features. The resilience features that I mentioned earlier.

They can do circuit-breaking.

They’re even gonna do adaptive concurrency limits soon. Lyft is working together with Netflix to bring this feature into Envoy.

So it’s really exciting stuff.

At the same time, it’s very complex stuff.

So I would argue that you shouldn’t blindly follow the hype train here, and you should really look at your use case and at the scale at which you’re operating in order to decide which tools fit better for you.

So your knobs may vary on this, massively.

Your knobs may vary

To be a bit more specific, when it comes to libraries and frameworks, it’s easy to operate, right? It’s just code that lives inside of your component or inside of your service.

At the same time, it’s very hard to enforce, which is where side-car is fine.

For example, in our case, I was guilty of building some sort of SDK just because we were not implementing Hystrix in a proper way across our services.

So I was like, okay, let’s just build SDKs for all the services, so we ended up with these fat clients that had lots of transitive dependencies that were hell to integrate with.

Because, whenever you added them as dependencies, then they would conflict with other dependencies in your services and it was just hell.

Please don’t do it.

It’s a really bad idea.

Envoy

On the other hand, Envoy proxies are just something that you can apply uniformly across your entire infrastructure.

So it’s really good and it’s really easy to enforce, so that you can know for sure that people are not forgetting to add Resilience features to their service calls.

And that you can implement these resilience features in a consistent way.

Libraries are simple to test.

Side-car proxies are not.

You have to spawn them in some sort of Docker container.

After that, run some integration tests against them, whereas with something like Hystrix, you can just build unit test and that’s simple to test.

On the other hand, Side-car proxies are polyglots.

So if you have a very heterogeneous environment in terms of languages, then you would benefit from implementing something like a service mesh because you could implement this irrespective of whether the client is an OJS service or I don’t know, Spring Boot or whatever.

Finally, libraries are relatively simple to configure and most of the configuration is just done in code.

Side-car proxies are hard to configure, that’s why there’s so much hype around stuff like fancy control planes like Istio is because a lot of the pains of configuration are actually mitigated by these control planes.

At the same time, if you look at the library like adaptive concurrency limits, it’s very hard to predict when you’re gonna be throttled and how you’re gonna be throttled.

Whereas, with something like Envoy, it’s really easy to predict this, or at least it’s very easy to visualize it because it has excellent observability and monitoring.

And this brings me to the most important point when making this choice, which is observability.

Please, make sure that you’re always in control of what happens and you always see what’s the status of your backend infrastructure.

What are the circuits that are open so that you make it easy for your backend engineers to debug live production issues and so on.

And if you don’t believe me, maybe you’re gonna believe Matt Klein.

This person is actually the daddy of Envoy and he’s advising you to do the exact same thing, so it’s actually me quoting Matt Klein, not the other way around.

But he is pushing you to really consider your use case before going for something that’s very heavy, that’s service mesh, or something like that.

And with that, I’m happy to take any questions that you might have.