Surviving Blockbuster Game Releases at EA | Datadog

Surviving blockbuster game releases at EA


Published: 7月 12, 2018
00:00:00
00:00:00

AAA titles at Dice

So I work at a company called Dice.

And if you’re not super into games or if you’re not super into first-person shooter games specifically, you might think of Dice as a career website, but it is in fact also a game studio from Stockholm, Sweden.

We’ve been around for about 25 years.

For the past 15 of those, we’ve built a series called Battlefield.

And this is the latest installment of that series.

And if you’ve never heard of that, that’s fine, I can briefly describe what that game is all about.

And essentially it is a virtual battlefield, where you can play 64 players and you can play in a game mode called Conquest. And you have five flags, and your mission as part of a team is to dominate those flags.

So this is not a war simulator, the goal is not to kill as many people as possible, but the goal is to use a bit of tactics and a bit of strategy and the occasional James Bond moment, and have a lot of fun. And win the round with your friends.

So, Battlefield is nowadays, belongs to a category of games called AAA titles. And AAA differs a bit from…

If you think about how easy it is to make a game right now, the barrier of entry for creating games is really, really low.

So you can take a team of five and you can spend six months, and you can release a game on Steam.

And it might gain traction and it might slowly become more popular in the same way that a startup grows, right.

But AAA games is something different.

If you want to understand AAA titles then don’t think about a large software engineering project but think about like the production of a Michael Bay movie.

So hundreds of millions of dollars in budgets, hundreds of people working over multiple years at creating something that’s incredibly visually stunning.

I mean, look at this trailer, all of this is in-game footage.

I don’t have any audio here but you should listen to what our audio team creates, it’s simply amazing.

So these are incredibly large productions.

And what’s also large is the marketing budgets for these titles.

Challenge one

So this is the release date for our next game.

And what our marketing team does, they do many things of course, I’m not gonna pretend to understand all of it, but two things that they do that affects me as an engineer is, that they’re gonna take this date and they’re gonna drum up an immense amount of excitement that will peak on this date.

And they’re also going to make sure that everybody knows about this date.

So anybody who likes the Battlefield series is going to know what happens on this date.

And the effect of that, for me as an engineer is this, so this is a typical launch week for a Dice title, right?

And I don’t know if it’s super clear on this picture, but we go from essentially zero to the highest peak that we will ever observe in traffic in 48 hours.

And this is not, you know, tens of thousands of users but we might sell 10 or 20 or 30 million copies in the first weeks.

So this is an immense amount of traffic.

So that’s one challenge.

How do we deal with that?

Challenge two

And the second challenge is this, so this is a few months after launch.

And what’s happening here is that there is a slow decline.

Because when we release our games, people will do nothing but play it for a few weeks.

But then they’re gonna start, you know, perhaps spending some time with their family, or going to the movies, or perhaps, you know, play a competitor’s game.

So what happens is that there’s a slow decline.

And from a load perspective, that’s awesome.

Because post-launch we’ve survived our peak.

We now have the slow decline.

But the problem for us is that to survive the peak we’re gonna build out capacity.

So we’re gonna build out capacity so that we know that we have more than we need.

There are different strategies here of course, I mean, what you could do is not build out capacity and show an error message, or place people in a queue.

“It’s your time to play in forty-five minutes.”

We try not to do that.

So we try to build out more capacity than we need.

If there are any CFOs in the audience you’re gonna realize that there is a big void here.

So there’s a void where we have capacity that we pay for but are not using.

So that’s another problem that we have, a very real problem for us. Okay.

So what I’m gonna talk about today is three things: backend services, what kind of services the games need and how did we think about making them resilient enough to withstand the launch.

Load tests, very real for us.

And the reason for that is that we can’t test in production, for obvious reasons.

And game servers, which is probably the more interesting of our topics, since it’s a very, very specific problem for games.

Backend services

All right.

So, when it comes to backend services, what does a game need?

Identity and commerce

So first of all, identity and commerce.

And I’m sorry if there are identity and commerce people in the audience, but this is super boring and not very specific to games either.

But you need to be able to authenticate, you need to be able to reset your password, we need to keep track of what subscriptions you have, what you own, stuff like that.

Not very game specific.

Matchmaking

Matchmaking, on the other hand, super specific game problem.

If you’re unfamiliar with the term, the problem of matchmaking is that when we launch our game, when a player launches our game, we’re gonna have a big nice button in the UI that says, “Just put me on any server.”

So a player can click that.

And then we have an additional problem because now we have that player and tens of thousands of other in the same situation and we need to find them a round, or if a current round doesn’t exist we need to create a new round and we need to make sure that it is balanced in a way so that everybody is having fun.

You want it to be balanced so that it’s…you’re uncertain up until the last minute who’s gonna win.

That’s a great game.

So that’s a very, very interesting problem.

Stats

Social backend services.

I don’t know if stats is a super specific game thing that we build, but stats in this case, essentially is a set of counters.

So for a Battlefield title we might have 10,000 counters per player, and it might be something like, if you’re in a tank we count the number of seconds you’ve spent in a tank on a certain map for a certain game mode, and we spent the number of shots you fired, we spent the number of shots that hit.

And from shots fired and shots hit, we can derive your accuracy.

We can take that accuracy and we can create a leaderboard from that.

And we’re not gonna create one global leaderboard for that, but we’re gonna create one for the world, one for your country, and one from our every level of your ZIP code.

So we’re gonna create hundreds of thousands of leaderboards.

And the advantage of creating many leaderboards is that that increases the chances that we’re gonna find one leaderboard in which you are at the top, because everybody is probably hopefully good at something.

Social

Other social services.

We keep track of where your friends are playing.

So when you launch a UI, we might pop up a title that says, “Hey, your friend is playing on this and this server.

Do you wanna join her?"

And you click one button and we join you with your friend, for example.

Analytics and telemetry.

Obviously, this is, I mean, not player-facing services but more internal-facing services, for example, heat maps for our level designers.

So when our level designers build maps that our players play on, these maps are not symmetrical because it would look weird, it would look unnatural if they were totally symmetrical.

But we need the advantages or disadvantages for both teams to be symmetrical.

You don’t want one team constantly winning simply because they were randomized to one part of the map.

So for that, we can generate, you know, take data from our production environments and generate heat maps, and show, I mean, where do people congregate, where are the sniper purchase, have people find out ways to go where they shouldn’t be able to be, for example.

So these are some examples of the services that we build.

Dice’s stack

Okay, so how do we build these things?

This is not gonna be surprising, because this pretty much looks the same as everybody else’s stack, right?

We used to have a monolith, and we have started splitting out services, and now we don’t have a monolith anymore.

So we’ve been doing that for the past five years or so.

And we’re on Scala, we use Finagle.

If you don’t know what Finagle is, it’s Twitter’s RPC system that they use for their market services, or at least used.

And if you don’t know what an RPC system is, that’s fine because I can briefly explain it.

So an RPC system is a framework that allows you to specify your public API for your service.

In our case, we use Thrift.

And then for the server side you can generate a server and fill in the blanks, essentially.

And for the client you generate the client.

And then the RPC system will make sure to handle everything that’s hard, essentially, in communicating client to server.

So things like service discovery, client-side load balancing, circuit breakers, retries and retry budgets, stuff like that.

And it has worked amazingly well for us.

We run this on Mesos and Aurora.

We started this five years ago or so.

So Kubernetes wasn’t to mature enough, today we would definitely be in Kubernetes.

But Mesos and Aurora, essentially that’s orchestration, right?

So the two of them together fulfill the same task as Kubernetes does.

Apache Aurora also a Twitter project, so Finagle and Aurora fits like hand-in-glove, essentially.

We use the cloud.

We feel that there’s better ways to use our time than to operate Kafka or Cassandra, for example.

So no big surprises here.

Operational experience as code

But if you want one takeaway from this talk, then it should be this, that we have never made any decision as good as using Finagle, Mesos, and Aurora.

And the reason for that is that, if you look at the feature list for Finagle you’re gonna see…

You can read between the lines, you can see that most of the features has been added because somebody was woken up in the middle of the night.

So there’s a bunch of stuff, a bunch of features in there that is the result of an SRE having a pager go off.

And that means that when we use these products, we don’t have to wake up, right?

Retries and retry budgets are my favorite example of this.

So retries is pretty natural, right?

If you have an idempotent RPC, and it fails then you can just retry it.

So you build that into your client and that’s great.

Now you’re gonna improve your availability, and if you do speculative retries you’re gonna improve your latency distribution as well, right?

But, what happens if you have a shaky server and you have a bunch of clients that starts to retry whenever they fail?

You’re now gonna go from a shaky service to an offline service because your client is essentially gonna DDoS that service that was a bit shaky.

So that’s why you add retry budgets.

Retry budgets essentially say that per time unit, you are allowed to retry this many requests but above that, I mean, don’t retry, right?

And somebody was probably woken up in the middle of the night, or at least had the operational experience to know that you need…if you add retries you need to add retry budget.

And we had no idea but we got that for free.

And that’s incredibly amazing, I think.

And that’s the reason I recommend everybody to run their stuff on Kubernetes.

Not because it’s cool, or trendy, or because the dev experience is great or it’s declarative, but simply because of the amount of operational expertise that is available as code for free in this product.

You can’t buy enough, you know, super talented SRE staff and get the same result.

Observability

So I wanna briefly describe our journey on observability as well, since we are at Dashcon, right?

So in the beginning there was nothing, when we had our monolith we essentially had a set of counters that we could watch per minute and a set of rudimentary graphs.

And the only people who could interpret this data were the two people who also knew how to deploy our monolith out of 15.

So not a great situation to be in.

So when we started splitting out services and run them separately, we started using Datadog.

And Datadog has been fantastic.

One thing that I really liked about when we moved from essentially nothing to Datadog, was the fact that we now had individual engineers who started to show interest in our observability data.

So they would look at the dashboards, they would figure out what metrics were missing, they would go into the code, they would add those metrics, and then they would make sure that it was visible in the dashboard.

Nowadays, we use a bunch of different stuff. We still use Datadog but we also use Prometheus, Grafana, Zipkin, and whatever we need to get our job done.

And this is for backend systems.

Load testing

So we do load testing.

I know load testing has fallen out of fashion.

You’re supposed to test everything using small increments in your production environment.

The problem for us obviously is that we do big bang releases.

And so we need to do load testing.

So Dice is a part of Electronic Arts.

And a bunch of the systems that we use are Electronic Art systems.

So authentication and information on what you own, for example, those are parts of the Electronic Arts system.

So you can use the same account to log into FIFA that you can use to log into Battlefield.

So load tests at Electronic Arts is an EA-wide effort.

We need to involve large parts of our organizations to run this these tests since they are going to affect systems from all of our organization.

We use a bunch of different tools to run a load test.

This is one of the tools that we have used the longest.

This is a tool called Locust.

It’s actually created by us, but it’s open source.

And I think it’s seven years old now, so it’s not as revolutionary as it once was, but the thing about Locust and the reason we created it was that Locust is user-centric.

What you do is that you create user scenarios in Python to describe what a user would typically do on your site.

And then it’s distributed, so we can take this and we can run it on 100 machines.

And this was actually the first time we started using EC2 was to generate load against our monolith, which was running on-prem.

So, a fantastic piece of software.

I know that many pieces of software can do essentially the same nowadays, but if you haven’t looked at it, I would recommend that you give it a try.

Prelaunch events

So load test are one part.

The other part that we do to validate our launches is prelaunch events.

That’s a strange word but what it means is essentially this: we let our players play our game before it is released so we can make sure to take a look at how features work, how our backends are holding up, and validate if the players behave the way that we expected them to behave.

So we have two major events, a closed alpha and an open beta.

And closed and open, in this case, refers to whether or not anybody can join or if you need to be invited.

And alpha, essentially refers to the state of the game at this time.

And both of these are pretty thin vertical slices of the game.

So we don’t want to essentially leak all of our functionality, all of our maps, all of our game modes, all of our new ideas six months before launch.

But we do want to get a sense for if our backends can hold up.

So closed alpha is a bit smaller, say that we invite a few hundred thousand players.

And open beta is much, much bigger.

Usually not all the way up to launch size but close enough for us to validate what we need to validate.

Cool.

Game servers

So now probably for my favorite part of this talk, game servers.

So if you don’t know what a game server mean, that’s totally fine.

When I mentioned earlier that we have 64 players playing together, what each game playing does when you play this game is that it connects to a game server.

So we have 1 game server and 64 clients.

And then we run 65 simulations of our game.

But the game server is in authority so the game server can correct the clients.

So the way it works is that if I’m a client and I play the game, and I use my controller.

And I say that I wanna walk forwards 10 feet, then I’m gonna do that in a very smooth motion on my client.

And 100 times per second we’re gonna send updates to the server and say that, “Hey, I’m still walking, I’m still walking.

And now I’ve stopped."

And the game server is most likely gonna say that, “Thumbs up.

That’s great.

You walked 10 feet.

You’re now at this position, the same as you claim that you were."

And it is going to tell all the other clients that you walked those 10 feet.

But it might be that the server decides that you can’t walk 10 feet.

So the server can correct you.

And this happens now and then, and mostly players don’t notice that.

But what can happen is, if you are corrected a lot, it means that you experience something called rubber banding, and it feels like you’re attached to a rubber band, which yanks you around in the world, essentially.

Operating game servers

So operating game servers is really an issue of high and low.

And the low part of it is that this simulation runs in 60 hertz.

So we do this simulation 60 times per second.

And that means that we have exactly 16 milliseconds per iteration to complete it.

So I wouldn’t call these systems real-time systems, but at least there is some sort of soft real-time systems, because we can’t block these systems for very long without experiencing issues.

So we need pretty strict control over hardware, OS settings, kernel configurations.

So that’s the low part of the challenge.

The high part is that we run a lot of these game servers.

So, if you imagine that we have one million concurrent users playing our game.

And we have 64 players per server, then that is a large amount of servers that we run during launch.

Process-level metrics

So when we run these servers, we don’t run…

These are essentially single-threaded processes.

We don’t run one per host, we run multiple per host and we load test to figure out how many we can run, but typically for our large game modes we might run 10 servers on a 12-core box, for example.

But if you look at these individual processors, these single-threaded game servers, then typical stuff that we monitor is these frame rates.

Because if we drop frames or we drop packets, what’s going to happen is that either players are gonna experience the rubber banding or they might experience a situation where they see a tank that’s low on health and they pop up with their RPG and they fire a shot, they hit the tank, but nothing happens.

And that could be because the packets that contain the information that you fired upon this tank, never reached the server because the server might have been overloaded and had to throw some of these packets away.

And this is incredibly frustrating, obviously, if you’re a player.

Host-level metrics

So we also monitor host-level metrics.

And this is because obviously, we run these servers on Linux and if you do that and you are a soft real-time system, then other processes on your system could affect the way you run.

So context switching is something that we monitor.

Map loads, map loads is essentially when we start a new round, the games server is going to load a map from disk.

And this map is pretty big.

So this is a pretty I/O-heavy operation.

And if you’re not careful how that’s orchestrated, it might be that other games servers running on the same core, could start missing frames.

So people playing on that server might experience rubber band.

Cloud

So, the cloud.

It’s been great to us.

I’m not gonna lie to you.

For running game servers, and especially for the problems we have where we need an immense amount of capacity over a few weeks, the cloud has been great.

So for cost the cloud is fantastic for running game servers.

It’s still not the case that we can go to Amazon, or Google, or Azure and tell them to give us 30,000 servers, but still enough that we can reduce the cost problem significantly.

So that’s one aspect.

But the best aspect of the cloud for us has actually been bringing the people who build games servers and the people who operate games servers much, much closer together.

Because what we were pretty early in doing was to create software to operate our game servers.

So we have some special software that we have developed at Dice, that actually starts up and shuts down game servers all across the world, depending on what capacity we need at any given moment.

And the advantages of having that in code is that you can then take the people who build games servers and you can point them to this code, and you can say, “This is exactly how we operate game servers.”

And that makes it much easier for them to be a part of that process when it comes to running.

So essentially DevOps in operations.

Our road to observability

Yeah.

So, a few slides left.

So, when it comes to observability and understanding how these game servers run, when we started running game servers that was a very long time ago, like 15 years.

The way it worked back then is that you would literally throw this over the wall and not even within the same organizations, there would be external companies that would run these games servers.

And there would be very little feedback coming back on how they behaved.

So things like rubber banding, different kind of issues that our player experienced, took a very long time to diagnose and fix.

So we started creating our own tools for observability.

So this is the first tool that we created.

And what you’re looking at here is…

Each one of these squares is a game server.

And we’re looking at a particular metric, I can’t remember which one it is, but I think it’s frames per second.

So most of them look excellent.

And some of them look okay, and okay is an acceptable metric here, it’s an acceptable value.

But this is the equivalent of opening the door to your data center and seeing if there are any red lights anywhere.

So this is a good second-to-second overview, which is incredibly good in those 48 hours when we have very little time to react other than to second stuff, but there’s no history here.

There’s no way for us to compare different metrics, for example.

So we started sending aggregates to Datadog, which we were already using.

And suddenly our game server people who had…they were used to, you know, perhaps solving problems over weeks.

Now, suddenly they can get minute-to-minute information on how exactly everything was behaving.

So that was step one. Step two was to actually have our game servers send the metrics directly to Datadog, which is what we’re doing right now.

So the game server people are happy and the players are happy, which is what it’s mostly about, I think.

All right.

Surviving a game launch

So the topic of this talk was, “How to survive a game launch.”

So, how do we manage it?

Well, I would argue that we do it by preparing, because we don’t have time to react very much in 48 hours.

It’s better now, but ten years ago if you wanted to order physical hardware you should count yourself lucky if you could do that in 48 days, much less 48 hours.

So what we do is we do our prelaunch events.

We run our load tests.

We try to design our systems so that they are resilient.

We always have more capacity for our backends than we need.

And we try to make sure that we have the observability that we need, because we know that during those 48 hours there’s probably not gonna be time to add more metrics.

We would rather have 100,000 more data series than we need, rather than too few.

So that’s the way we do it, I guess.

Are there any questions?

Questions

Audience member 1: I was wondering, so do you guys have those metrics or, like, when you’re saying observability, do you even have that when it comes to like alpha releases, or like beta testers, or things like that?

Or is that something you’re adding as the game becomes more mature?

Because the game is not fully released at that point, so.

Johan: Yeah.

No, that’s something that we add from the beginning.

So a part of these closed alphas and open betas, for example, is to try not only, you know, the game and services and systems, but also operational stuff.

So what we do, actually, during closed alphas and open betas is to essentially break stuff.

So chaos engineering.

So we introduce latency, we take systems offline, we do stuff like that to test our operational capacity as well.

But, yeah, we try to make sure to have metrics as early as possible.

But obviously we will add some during development as well.

Audience member 2: Hi.

In 2014, I think 2015, whenever Battlefield 4 was released, there was a big controversy over, like, the latency on the servers, and I know that later on you guys had released a bunch of servers that were higher hertz, I think it was like 60 or 120.

Can you walk through like what the infrastructure looked like to accommodate the higher refresh rate on those servers?

Johan: Yeah.

So the answer would be no.

And the reason is that I’m mainly a backend guy.

So the extent of my knowledge is what I’ve shared with you so far.

Cool.

No more questions?

Then, thank you all for coming.