How DraftKings Solves the Microservices Murder Mystery With Circuit Breakers

How DraftKings solves the microservices murder mystery with circuit breakers

Published: 7月 12, 2018

00:00:00

An introduction to DraftKings

Travis:

So, our mission is to bring our fans closer to the games they love.

So, what that means is that we’re a daily…oh, I forgot about this, this is a late insertion.

We saw this on the streets of New York today.

So, Drape Kings has apparently heard of us.

I thought that was pretty neat.

So, this was a late insertion into the slide.

So, as Jeremy said, I am Travis Dunn, Chief Technology Officer at DraftKings.

I’ve been there about four-and-a-half years.

Roughly employee 30 and now that we’re over 500.

So, growing very rapidly over the last five years.

For those that aren’t familiar with DraftKings, we are a sports entertainment brand and the largest provider of daily fantasy sports.

So, if you don’t know what daily fantasy sports is, it’s like season-long fantasy sports, but it’s condensed into as little as a single game or maybe even a week long.

So, there’s a salary cap that you use, like $50,000 virtual and you get to spend those virtual dollars on athletes.

So, in this particular case, Philip Rivers costs you $6,300 of your virtual salary.

That’s kind of a bargain.

So, basically, you create these fancy teams and then you compete for real money prizes on the website with friends and/or with the public.

So, just to give you a sense of our scale, we have 10 million registered users, which is a dramatic increase from when I started.

We handle up to a million API requests per minute.

We handle 22,000 entries per minute.

I’m gonna stop for a second there.

So, an entry is essentially when you take your lineup and you put it in a contest.

That is essentially an ecommerce transaction.

So, we’re holding your money and you’re buying the tickets.

So, you’re talking about 22,000 financial transactions a minute, and this is all powered by 70-plus microservices.

So, some subjective opinions here.

So, a little bit of a warning.

DraftKings is a .NET shop.

We are on Windows.

This is not the most popular decision, but we actually are big Windows fans.

And more importantly, we’re big C# fans.

And if you haven’t used Visual Studio, and you’re messing around with some crappy IDE, try it.

It’s actually pretty great.

Other technologies that we use, Amazon Aurora for our databases, which has been really exceptional for us.

Pretty standard client libraries with Objective-C for iOS, Java for Android app, React on the front end for our web.

We’re big believers in Datadog and we’ve actually been clients of Datadog since 2014.

All right, so let’s set the stage here a little bit.

We’re gonna follow our journey through microservices, but instead of talking about the microservice re: architecture, we’re gonna look at it from a simple lens of how we evolved a library.

This is also the story of outages, which is always a painful thing, but what you’ll see is one case at a time of where we experienced an outage and how we evolved our architecture through the lens of a single library.

All right.

I’m gonna let everybody read this tweet for a second.

How many of you are on microservice architectures?

How many people just died a little bit inside because of the truth of the statement?

Yeah, the complexity of a microservices architecture is not to be underestimated.

You really have to go in eyes wide open.

The advantage of a monolith is actually you know when it dies because there’s only one thing that died.

Case 1: Trial by query

Speaking of monoliths, we’ll start with case number one, which I call it “Trial by Query.”

This is the outage that started essentially our scale efforts for DraftKings.

Ironically, it is the thing that allowed us to grow and succeed.

So, some context, this is our architecture in early 2014.

You saw the title of the case.

Anybody spot a single point of failure in this system?

Yeah, that one.

So, in 2014, things were going very well.

We were growing like mad.

We just launched the first fantasy sports contest that awarded a million dollar top prize.

We were feeling really good about ourselves.

We launched a free contest that had an unlimited number of entries, unlimited number of entries, I’ll pause that for a second, that awarded a top prize of $100,000.

We were missing a very important feature back then, a “view more entrants” button on the mobile app.

So, this is not the core website, this is the mobile website, not even the mobile app.

So, this is a fraction of our traffic was missing a “view more entrants” button.

So guess what happened?

That $100,000 contest was really, really popular and a lot of people entered it…and the database died, and it died hard.

So, at exactly 1:00 p.m. the contest went live.

People started to see who they were gonna compete against and everything crumbled.

So, what we learned that day is, A: a database as a single point of failure, not great.

B: if you’re ever having the debate of stored procedures versus inline SQL, stored procedures are better.

I know it’s an opinion, but it means you can fix these things very quickly.

And it also meant that we didn’t wanna live through this again.

Another casualty of that day was my phone.

I was on call in the office and it did not survive.

A little bit of a detour, I was also supposed to go out for dim sum that day with a colleague of mine.

We never made it.

I’ve never had dumplings for breakfast ever again.

So, there comes a moment with every growing company that is experiencing problems.

You wonder, what would Netflix do?

They’re a lot bigger than us, they’re a lot more open about their architecture, and honestly, I’m was a little bit envious, thus the green.

So, when we started looking around, we discovered a library called Hystrix.

So, Hystrix is a circuit breaker technology and it looked exceptionally cool.

We really wanted to use it, but did I mentioned we run C# and .NET?

Well, this is Java.

So, what do we do?

Creating Ground Fault

We built our own circuit breaker library that we called “Ground Fault” after the safety bathroom outlet there.

And what does the circuit breaker do?

Essentially, it looks for error rates and if a certain error rate has exceeded, it will shut off any additional requests to that system.

So, you can imagine if we had this in place with the trial by query death of the database, it would have determined that this is slow, it’s timed out, and would have stopped access to that query.

So, that would have saved us from going down that day.

So, essentially, a request, you get an error, you update your metrics, a request, an error, you update your metrics, circuit breakers close, no more.

So, there’s some key properties that make our circuit breaker work.

Some of them are pretty self-explanatory.

Resource name, which is the logical resource.

You might say the database or this web service.

The command name, which might be “get more entrants for your contest.”

The timeout, so how long to wait before erroring out.

The failure threshold percentage.

So, this is like, you can set it at like 80% of requests fail, you then trip the circuit breaker, no more traffic.

Minimum volume per second, we learned this the hard way.

Essentially, if you have a very low-volume service that has an error, that’s 100% failure rate and it trips.

You don’t like that very much.

And then open sleep window is essentially how long to wait before trying to open the circuit breaker again.

So, how does it reopen again?

So, that sleep window I talked about, essentially when you make a request, it checks to see if the sleep window has been exceeded and then it will try one single request down to that operation and say, “Does this work?”

And if it works, it then closes back up and takes traffic from there.

So, we spent a lot of time making this as simple to call as possible.

So, it’s a little hard to read here, but essentially the only thing you have to add here is a very simple wrapper around the code that you’re ordinarily writing.

So, the simplicity here actually drove adoption very quickly and suddenly all of our traffic was using this single library.

This is gonna become important.

The widespread adoption here has really helped us move forward.

So, this is when Ground Fault came to the awareness of our executive team.

So, a few months later, we rolled out Ground Fault.

You can see the tightened Ground Fault trips.

This is Jason MacInnis, a very early employee and the CTO before me.

That’s 9:08.

Two minutes later, you see latency back to normal.

Jordan Mendell, who is our Chief Product Officer, is like, “What’s a ground fault?” which cracked me up.

Errors spiked for a couple minutes, but it was protected.

And Andrew Arace, one of our team members says, “Hey, it’s our internal circuit breaker.”

So, this is the first time when our executive team is like, “Oh, this tech is cool. We just stayed up.”

So, this is a big step forward.

Case 2: The thundering herd and a saber

All right.

Case number two, “The Thundering Herd and a Saber.”

So, earlier today…I’ll get to that in a second.

So, this is a classic quote, “There’s only two hard things in computer science: cache invalidation and naming things.”

This is actually a story of both cache invalidation and naming things, kind of, oddly.

So, we could have named our services something very boring and generic.

So, we could’ve had an accounting system that was called “DraftKings Accounting System.”

It would eventually turn into DOS, which actually probably would’ve been pretty cool.

But some of these other services, it gets really confusing and you end up with an acronym.

So, we took a different pathway.

Our accounting service, for example, is called Blackbeard because, I don’t know, it’s more fun to think about a pirate and money than it is to think about DOS.

There’s not a lot of correlation there.

So, in this particular case, we had a microservice called Saber.

We thought we were clever because it returns metrics, advanced metrics, about athletes.

If you’re a sports fan, Saber metrics is what they call advanced statistics, so it’s kind of a pun.

But anyway, that’s why Saber is a thing.

And then one day we cleared the cache of Saber.

We did not know the dependency graph that Saber had generated over the last few years.

2015, it was literally, hey, we would hire three engineers and say, “Welcome to the team, build this mission critical piece of software tomorrow.”

And then you deploy it a week later.

So, as we created Saber, the system that returned all these metrics, we thought it may have one or two dependencies.

It turns out there was an explosion of dependencies.

We had several services that were depending on it.

We cleared the cache of all those services and the thundering herd took it down.

So, essentially we had hundreds of web servers, hundreds of applications servers calling into a very small a microservice.

When we saw things like the service maps, the Datadog I was showing today, I was drooling.

That’s the sort of stuff that accidentally happens in our architecture, microservices architecture, all the time.

Your dependency graph just explodes.

So, in this particular case, we had Topps, Grouper, Titan, Bach, and more all depending on Saber.

They all had local caches, we all told them to clear, and the thundering herd took down Saber.

So what do we do?

This is actually where Datadog enters the picture.

We realized that we had terrible visibility into how our system was performing.

But as I mentioned earlier, we now have a single point where every API call, every database call was going through.

So, we had a chokepoint to add things like instrumentation.

And I put this here because this is literally the entirety of the monitoring code that tracks an enormous amount of data in our system.

This drives most of our Datadog dashboards.

It drives a lot of our alerting.

It’s really become an amazing transformational thing for us, in visibility.

So, out of this, we get charts like this, which is kind of the poor man’s service map.

So, in this particular case, we have a microservice called Grouper, and here, all the other microservices that call it.

So, being able to see your dependencies and visualize it, we got all that by routing all of our traffic through that single chokepoint.

Other things we got.

So, this actually happened just the other night.

If you’re on Amazon, you know those RDS replicas just sometimes tip over for no particular reason and failover to the next node.

This happened a few nights ago.

You can actually see just the bump up in errors.

So, this is all tracked through our Ground Fault monitoring.

Then about an hour later, it trips some errors again.

At the same time, you can also see the circuit breaker is tripping.

So, that darker blue is the circuit breaker opening, stopping traffic from this microservice calling into that database.

And then you see it closing again, and then open, and then close again for the two blips and errors.

And those correspond to a database node failing and then getting shut off by the circuit breaker and then us restarting it and coming back online where error was coming back online.

So, this is stuff that we see all the time and our dashboards are driven by this.

So, what you’ll see around our office is a lot of rich dashboards like this one, and most of these metrics are driven by our circuit breaker technology.

So, this one happens to be the metrics around contest entry.

So, this is kind of a quiet time.

We’re handling 391 entries per minute.

That’s a quiet morning.

As I said, this can get as high as 22,000.

Gronk spiked our servers

All right, who is a New England fan?

New England Patriots.

All righty.

I recognize I’m in enemy territory here.

Who hates the New England Patriots?

All right.

So, at DraftKings, Rob Gronkowski is actually the patron saint of our company, in my opinion.

You can actually see him in the title slide there.

There’s Rob Gronkowski being a tackle.

Jeremy is gonna kill me because it breaks his heart that he’s from Buffalo and went to New England, but that’s all right.

But what Gronkowski is very known for…well, actually what is Gronkowski known for?

Audience member: [inaudible]

Audience member: Energy drinks.

Travis: Energy drinks, all right.

Spiking the football.

So, scoring touchdowns and spiking the football.

This image in my mind conjures up excitement as a sports fan and terror as the CTO of DraftKings.

Because what happens when Rob Gronkowski scores a touchdown is literally everyone in this room, if you drafted Rob Gronkowski, will pull out your phone and start refreshing incessantly.

When you scale this out to our traffic, you get this.

So, you’ll actually see our traffic just kind of quietly humming along.

This low part is actually our right traffic as people are entering contests.

Kind of some bumps along the way, something exciting happened there.

And then Rob Gronkowski scores a touchdown.

And our traffic will go up 6, 8x within seconds.

So, things like being able to score very quickly become critically important.

Things like Amazon auto-scaling groups kind of suck.

They’re way too slow.

So, we’ve had to customize our scaling efforts.

Actually, if you’re interested in that, there is a presentation online from AWS re:Invent on how we handle scaling for this, but that’s not what we’re talking about today.

It’s more of Ground Fault and how our systems have become more resilient.

So, what happens?

So, Rob Gronkowski just scored a touchdown.

All right.

We all run to the knock.

Literally, we all run to the knock.

This has happened several times.

It’s like, “Uh-oh, how big is this gonna get?”

And occasionally, it’s gotten too big, and what happens is some microservice somewhere says, “I’m struggling here.”

And what they do is they start sending back 503s, “Hey, I’m in trouble.”

And what that did traditionally is it would end up flipping the circuit breaker.

Well, guess what?

Then that microservice starts erroring and then it flips its circuit breaker.

So, you get this cascading trip of circuit breakers across the system.

So, an outage or a degraded service moment that might last a minute suddenly becomes five, six minutes of problems.

So, not great.

So, again, with that single chokepoint, we’re able to do some interesting things.

The easiest thing we did was we just added a special little 503, “Hey, Ground Fault, I’m good.”

Now, this is how a microservice says, “Hey, I’m shedding the load appropriately, I’m handling this,” so the Ground Fault circuit breakers don’t trip."

So, what this has resulted in is like a small blip in activity instead of create an outage or a degraded service moment of minutes, it’s now just seconds.

So, in the case of Saber, it might say, “Hey, I’m good, I’m busy, but I got this, I’m shedding load effectively.”

Now, if you’re getting a 503 without that, it probably means you don’t have any healthy hosts in your load balancer and that’s a whole different animal.

All right.

The other thing that we discovered is databases don’t provide handy-dandy little 503 messages, they just die.

So, the way that we’ve dealt with that is we’ve actually added concurrency limits into our circuit breakers.

So, how this works.

So, I’m calling them the database, “Hey, get me all those entries to that contest.”

And then I call it again, “Hey, give me all those entries to the contest,” and they’ll start piling up.

Once you get too many of those piled up, we actually say you only can really have five of those operations live at a time or you should really stop making that call.

So, the circuit breaker technology now will just flip that circuit breaker and prevent too much concurrency from happening.

This has actually become pretty dominant.

We’re, like everyone else…

Actually, let me pause for a second.

How many people were super excited that Slack went down last week?

Yeah.

We are addicted to Slack.

You’ll see things like “Warn: command concurrency high.”

So, these alerts will fire off.

In this particular case, we’re recording an experiment in the database.

What happens with these experiments is they’ll all refresh at the same time and we’ll get a rash of concurrency.

So, these alerts, A: will let us know when something’s about to go bad.

And B: when it really goes bad, it shuts it all off.

Epilogue: Current state

All right.

So, where did we end up?

So, let’s pause for a moment here and look at how the circuit breakers evolved.

So, it starts with, “Hey, let’s begin the request.”

Then we talked about concurrency, like, “Hey, are we over the limit for concurrency? Are we under? Are we good for a circuit breaker? Are we good for a health check?”

And it just works its way through the system.

So, you can essentially see through this flowchart that what started as a very simple like, “Hey, we’ve got a problem, open the circuit,” has now evolved into a pretty complicated system or a complex system that actually has provided us a lot of safeguards and have kept the site up more times than I care to think about.

So, I’m really excited about this.

My phone is very excited about this.

And I think this is a great example of how architecture has to evolve with microservices.

It’s a different strategy than what you might be used to with a monolithic app.

You just find a problem, fix a problem.

Find a problem, fix a problem.

It’s more of Sisyphus rolling the rock uphill than it is some Big Bang.

All righty.

If you’re curious about these problems and you’re excited about these problems, we’re hiring.

I’m sorry?

If you’re excited about these problems like I am, we are hiring, we’re based in Boston.

We’re trying to build our team to roughly double over the next 12 to 18 months.

So, if this sounds exciting to you, please go to careers.draftkings.com.

If you want a sticker to remind you, let me know.

I’ve got some.

And with that, I’ll open up to some questions.

Q&A

Audience member: You’re Microsoft-based, so why not Azure [inaudible]?

Travis: What I’ve learned of being an early employee at a startup is, essentially, it all depends on 15-minute conversations in a hallway years ago.

So, in this particular case, it was like, “What technology should we use? We know .NET developers, it should be .NET.”

“What cloud providers should we use?

Amazon’s cheaper."

And that was really about it.

We’ve looked at Azure for a backup provider.

It’s pretty compelling.

It’s come a long way.

I wouldn’t sneeze at it, but at the same time, running Windows on Amazon, it might not be as hard as you think.

We’ve had a lot of success with it.

Audience member: You are .NET core?

Travis: No.

So, the question was, are we .NET core?

We’re doing some research and we’ve got some small things on .NET core.

We’re leveraging .NET core for Lambdas, for example.

But we have not replaced all of our microservice ecosystem with .NET core, but will probably will over time.

Audience member: [inaudible]

Travis: I’m sorry, I’m having a hard time hearing you.

Audience member: [inaudible] Oh, thanks.

Windows containers, like non-core Windows containers are huge, aren’t they?

How do you break those up?

How do you manage huge containers?

Travis: Yeah. So, the question is about Windows containers, we’re not using them right now.

Azure has got really good support for it, Amazon’s evolving into it, so we’re not using a lot of containerization right now.

We use pretty much bare metal…well, not bare metal, but virtual machines, EC2.

I think over time containers are gonna be super important, but it’s not where we’re at right now.

Audience member: What kinds of things are you doing to improve database scalability [inaudible]?

Travis: Yep.

The primary thing we did for scalability, so microservice is a part of it.

The other part of…so, the question, sorry, to repeat, is what are we doing for database scalability?

The big thing we’ve done for database scalability is partition databases into more logical domains.

So instead of one single point of failure database, we have domain specific databases across.

We also discovered that Amazon Aurora is awesome and we get a lot more volume out of it.

For our particular use case, we’ve got…I don’t wanna give an exact figure, but a lot more scalability.

So, our load tests basically bought us a couple of years and not actually doing anything major because of Amazon Aurora.

Audience member: Is that Aurora MySQL or Aurora…

Travis: Yep, Aurora MySQL.

And apologies, I really can’t see people, so…

Audience member: So, you talked about clients insulating themselves from the database.

Are you also mixing that with database-limiting max cons and services limiting their inbound connections?

Travis: Yes.

Yeah.

So, the question is, what other limitations are we putting in the system?

So, Ground Fault is a piece of it.

We use all the standard thread pool limitations, connection limitations, connection pooling.

So, Ground Fault is just kind of a piece of the pie, but we’re using a broader suite of those tools as well.

Audience member: Similar question on database performance.

One thing that we often face is read IOPS fail or reduce the budget reducers on Amazon for read IOPS and we see the length of requests.

So, the database is still up, the load is okay, but the rate at which it services the request goes down.

Does circuit breaker account for that or do you deal with that in some other way?

Travis: Yeah.

So, if things slow down, they exceed their timeout consistently, the circuit breaker will flip and shut things off.

If you haven’t tried Aurora, try it.

The latency is dramatically better, particularly on what you’re talking about.

But yes, so the timeouts and Ground Fault, if something exceeds the threshold that we’ve set for a long period of time, it will flip and shut off traffic.

Audience member: You’re a Datadog customer and have been since 2014.

Travis: Yup.

Audience member: What do you guys use for APM given that you’re .NET shop?

Travis: Yeah, for APM, I don’t know what we call this, “Travis tracing?”

We do not use a formalized APM right now.

We actually do do tracing.

We put it in ELK right now.

It’s surprising how much use you can get out of it.

I was looking at the Datadog presentations today.

I’m super excited.

The .NET agent is apparently coming.

I’m excited to try it.

But our adventures in APMs have actually not been that positive so far.

We tried New Relic for a while, it didn’t work for us.

It works for a lot of people, but it wasn’t effective for us.

So, we have, I don’t know, poor man’s ELK tracing.

It does work and does provide that same information.

And ironically, we actually do have some plugins and make it almost look like those heatmaps we were looking at today at Datadog.

Audience member: What’s your process for determining the thresholds?

Travis: What’s our process for determining thresholds?

So, Datadog actually provides a lot of value there.

We look at where the 95th percentile is for performance, then we’ll dial that back.

One of the interesting things is we will see what the 95th is and we might double it to take account of the flux, the normal flux in traffic.

What’s interesting about our user base though is we have some really passionate users that enter a lot of contests, and they’re a very small percentage of our users, and then we have a lot of users that are very normal.

So, the 95th percentile is often the cutoff between our high-value customers and our casual customers.

So, we have to be careful there just doing raw metrics, looking at how things are performing because we’ll actually hurt our most valuable customers if we just use that blindly.

So, it’s tuning.

We do a lot of load testing over the summer, typically, where we’re fine-tuning things and reevaluating where our thresholds are.

Now, one great thing about how we’ve done Datadog is we actually have a centralized repository of all those parameters we were looking at earlier for timeouts, error thresholds, etc.

So, we can actually tweak these in real time.

So, if we’re seeing ongoing problems, we can actually go in the database and adjust a timeout and it takes effect fairly quickly, not immediately, but fairly quickly.

So, we can tune these based on the traffic that we’re seeing.

Audience member: Quick question.

Does tripping a circuit breaker usually end in a user seeing an error?

Travis: Good question.

So, the question was, does tripping a circuit breaker usually end up in a user seeing an error?

The answer is no.

Often what happens is we have fallbacks built in the circuit breaker.

So, if you get an error, you return a cache response.

Often what it amounts to is like, “Hey, we might have this useful feature on the site that is now disabled, but you can still do everything else.”

So, degraded experience is more often than an error, but errors can happen.

Those Gronk spikes sometimes are pretty intense.

Yeah?

Audience member: [inaudible]

Travis: Yup, let me go back.

So, the cascade you’re talking about, so you see a request from service A to service B to service C.

Service C has a problem and it ends up tripping up service B and A.

So, the way that we solve that is actually by treating some errors as like, ignore this.

Ignore this error for your circuit breakers.

So, the way that we signal that is just a 503 with an annotation that says, “Hey, Ground Fault, I have this, I’m good.”

So, this is what prevents the cascading of the circuit breaker.

So, if service C goes down and trips, it won’t take down B, it won’t take down A, just service B knows that service C is okay.

It’s shedding its load effectively.

Answer your question?

Audience member: A little more on the same question.

I’m curious why the logic…why you guys chose to put the logic of, “Hey, I’m handling this” into individual services rather than let the circuit breaker be…rather than allowing it to flow rather than completely on/off.

Travis: Yeah, it could probably work the other way as well.

For our particular case, the thread pool library that we use is actually pretty carefully tuned to deal with traffic.

So, we know very well…we can look at things like how long this request has been sitting for processing by a microservice.

And we can be really smart about when we shed load.

We might choose to like, “Hey, this is a slow thing. We’re gonna shed that quicker.”

It’s just gained us a lot more flexibility in our thread pool.

So, you could do it either way.

In our particular case with the custom thread pooling that we’ve got, it’s actually more valuable to do it on the microservice side than the client.

Yeah?

Audience member: Hi.

I was curious, where do you maintain state about historical latency error rate?

Is it all in process memory or is it a shared cache?

Travis: Yeah.

So, the question is how we maintain state of the circuit breakers.

So, it’s stored in memory.

We keep rolling counters essentially that roll over every few seconds and then we look past some windows.

So, concretely what will happen is we might have a counter that lasts for 10 seconds and then once that 10 seconds window is passed, we’ll throw it away and add another counter.

So, we might have three windows that we’re looking at, for example, and just kind of rotating through.

It’s actually just a little linked list implementation honestly.

But keeping it in process and fast, if we had to call to yet another remote resource for critical production, that would just… Redis goes down, we’re hosed, etc.

Audience member: [inaudible] patterns or tech on the individual instance of the service provider, not across the service?

Travis: Correct.

So, the question is, is the circuit breaker state kept locally with the microservice?

The answer is yes.

Again, we don’t really wanna have any dependencies on external resources when we’re making these decisions.

So, in some rare cases, you’ll see a sick node that’s having a networking problem over to a service that is kind of isolated.

So, you might have one micro service out of an array of seven of the same type that is sick for some reason and its circuit breakers might flip.

Yeah?

Audience member: Is there a strategy to test, like with production traffic, pre-production [inaudible]?

How do you know that the stuff works before [inaudible]?

Travis: Yeah.

Is there a strategy of how to test?

Yes.

So, NFL opening Sunday is our Black Friday, our Cyber Monday, our D-Day, whatever you wanna call it.

And our traffic goes up a lot on those Sundays.

So, we spend a lot of this time of year honestly looking at the year’s traffic patterns, scaling them up, tuning these metrics, really just providing the safety that we expect and want for opening day.

Our service traffic might go up 8x on that Sunday compared to normal baseline traffic.

We just get a lot more users.

So, we have to do a lot of load testing.

We can’t just guests in that particular case.

Yeah?

Audience member: Do you [inaudible]?

Travis: Yeah.

The question is, does it make sense to schedule our scale-ups in advance?

The unambiguous, absolutely clear answer is yes, we scale up in advance, always.

So, there’s actually a pretty good talk that we did at AWS re:Invent that goes into detail.

Our machine learning team has actually built algorithms to project our peak load and we scale-up hours before we’re expecting to get there.

Audience member: What was your first microservice [inaudible]?

Travis: Yeah.

So, our first microservice was actually scoring.

Scoring was pulled out specifically because of the “Gronk effect.”

Going fast there was very important.

Early microservices that we created where things are wrapped contests, things that are wrapped accounting.

It definitely was an evolution.

Those early microservices are quite a bit bigger than some now.

And honestly, it’s a matter of debate in the engineers of whether those bigger microservices are better.

I don’t know, not micro, but medium services are better than the nanoservices.

So, there’s a spectrum.

But essentially, we wrapped the core database problems we had.

So, we had a financial transaction, so we wrapped that in a microservice.

We had entry, we wrapped that in a microservice.

We had lineups, which is essentially your fantasy team, we wrapped that in microservice.

So, it was really like our core domain that we put microservices around first.

Audience member: Those projected scale-ups you just mentioned, do those actually takes sport related…make sure you don’t have to explain who or where games are or sports stories like this?

Travis: Yeah.

So, the question is whether we take into account who’s playing, the popularity of the sport, etc., who’s playing who, when we predict scale.

We actually use… I encourage you to watch the detailed presentation from Amazon, but we actually use the excitement of the fans entering contests to project out how much volume that we’re gonna do and how much we’ll need.

So, we have an algorithm that figures out how big of contests we need to make, how many we need to post, that takes an account all those factors, not intentionally, but as a side effect.

Audience member: Do you use log management from Datadog?

Travis: The question is, do we use log management from Datadog?

We do not right now.

It’s pretty cool.

I like it a lot.

We’re not using that yet.

We use ELK Stack for our logging.

That’s more of a matter of it wasn’t around when we built our logging infrastructure than any sort of judgment on the product because I think it’s actually pretty cool.

All right.

So, thank you all for attending.

I just want to reinforce that the approach to microservices, people seem to think it’s an end to itself, but it’s actually an evolution.

So, hopefully, today you got to see how Ground Fault helped evolve and cope with microservices and adjust over time.

Thank you all for attending.

And if you want a sticker…

How DraftKings solves the microservices murder mystery with circuit breakers

An introduction to DraftKings

Case 1: Trial by query

Creating Ground Fault

Case 2: The thundering herd and a saber

Gronk spiked our servers

Epilogue: Current state

Q&A

Start monitoring your metrics in minutes