Case Study: Game Server Monitoring | Datadog

Case Study: Game Server Monitoring

Learn how EA Dice uses Datadog to monitor game servers and stress test beta launches.

EA Digital Illusions CE AB (EA DICE) is a video game developer based in Stockholm. The company was founded in 1992 and has been a subsidiary of Electronic Arts since 2006. Its releases include the Battlefield, Mirror’s Edge, and Star Wars Battlefront series.

Battlefield V open beta: lifelines don’t apply to game servers

David Rohr was monitoring the beta launch of Battlefield V with increasing intensity. This was a stress test in more ways than one. As a veteran of the gaming industry, Rohr knew that the beta launch of a new game is always a critical event. In the case of a AAA title like Battlefield, one of DICE’s most successful releases of all time, the beta is often more stressful than the main launch. Millions of dollars of investment and multiple years of work by hundreds of people went into creating a game that’s visually stunning with incredible audio and sound effects—the production quality matches that of a big-budget Hollywood film. The pressure to launch successfully is intense.

Battlefield has a global fan base that eagerly waits for every release. Every feature, every level, every nuance of the game attracts discussion and debate. The game itself is a virtual battlefield, where 64 players compete in teams to fight for control over key locations using tactics, strategy, and situational awareness. AAA titles such as Battlefield have an aggressive launch period, since traffic peaks in the first 48 hours post-launch, often with upwards of 10 million users, which is why the stages preceding the main launch typically involve a lot of load testing.

Game server monitoring challenges facing the team

As the lead engineer for the game server team, Rohr has a ringside view of the action. His team is directly responsible for the player experience. Their first challenge was preparing for the scale and peak traffic in the hours following the game’s beta launch. The alpha release is usually designed to ensure that the game functions as desired; the beta, on the other hand, is specifically intended for load testing. If DICE does their homework correctly, the load pattern during the beta should help to stress test the game, ensuring that the main launch, which follows the beta, is boring and uneventful from an engineering standpoint.

The second challenge was finding and fixing issues that affected the performance of the game and the experience for the players during the beta. This included technical aspects like server stability, latency levels, and matchmaking (having a minimum number of players in each squad, on opposite sides, to initiate a game) as well as gameplay – weapon balancing, progression, and other aspects that need to be optimized to make the game as fun as possible. With the high number of concurrent players, the publicity and media attention of a high-profile AAA release, and the sky-high expectations from avid fans of the series, it was the perfect storm for the DICE engineering team. However DICE had just begun using Datadog’s Log Management platform—the ideal observability solution to diagnose and deal with these challenges.

DICE’s requirements for a central logging solution

DICE was a longstanding customer of Datadog, having started with Infrastructure Monitoring in 2013. Their primary focus was on making great games; infrastructure monitoring and custom metrics were tools that helped the DICE engineers make their games better.

The game server team was continually on the lookout for a central log management solution to complement their infrastructure monitoring with Datadog, and they had evaluated a number of logging solutions. From the outset, they were only interested in a solution that could provide insight into their logs without their team needing to run and maintain the logging system or incur any other overhead. A second requirement was finding a cost-effective logging solution for game server monitoring due to the large volume of logs. Finally, they wanted a logging solution that integrated with everything in their tech stack.

DICE’s infrastructure is 100% on the cloud and completely Linux-based. They run everything using orchestration services such as Mesos and Apache Aurora. The game server architecture uses Amazon EC2 instances with custom dynamic scaling extensions to AWS. It depends on myriad backend services and microservices, including connections to MySQL, Cassandra, Redis, Amazon RDS, and Amazon S3 storage services.

Datadog Log Management and the Battlefield V beta

When Datadog announced its Log Management solution in March 2018, the DICE engineering team quickly started a trial, during the closed alpha of Battlefield V. To set up log collection, they just had to add two lines of code to the same Datadog Agent that they were already running. Rohr quickly realized the benefit of having metrics and logs in the same monitoring platform: “The unification and seamless correlation helped my team find issues that we never would have found without Datadog logs; being able to get more insight into our game’s performance or diagnose issues that needed more details to troubleshoot was of tremendous value”. Having seen the value during the alpha, it was a simple decision to continue using Datadog Log Management for the beta of Battlefield V.

A crucial aspect of Datadog’s architecture is that logs are parsed on ingestion, not at query time, so users can search through billions of logs and quickly find the logs that matter to them. The ability to search on the fly meant that DICE’s engineers finally had the ability to become more proactive with their troubleshooting using logs, rather than reacting to issues only when they surfaced via other indicators. According to Rohr, “If my team needed even more context, they could easily just add that to their logs and we would instantly have searchable data. This meant that we could solve many issues long before they caused any issues to the players”.

During the beta, DICE anticipated that log volumes would exceed 12 billion log events per day. The cost of indexing such a high volume of logs would have been prohibitive with traditional logging tools. And even if the tools they had tried previously could handle the influx cost-effectively, DICE’s engineers had found that those tools had a steep learning curve and lacked correlation capabilities, leading to poor user adoption. However Datadog had an answer to DICE’s log management woes.

“Initially you have to prepare for the worst case with a AAA title beta launch. It’s easier when you have a monitoring solution that can scale with you; it helps my team to be better prepared for the worst to happen on day one.” David Rohr, Lead Engineer, DICE

Logging without Limits™ to the rescue

Datadog’s solution decouples log ingestion and log indexing, thereby enabling DICE to collect all their logs and bring them into Datadog, without any fear of incurring an expensive log management bill. DICE’s engineers did not have to filter their logs upfront or remove any of the log content since neither the size of the log files nor the daily peak ingested volume affected their indexing costs with Datadog. This incentivized DICE’s engineers to add more context or metadata to their logs that could help them when they were troubleshooting incidents.

Many high-volume logs, such as NGINX logs, debug logs, or browser logs, fluctuate in value depending on the circumstances. Logging without Limits™ provided the DICE engineers with a simple solution for such logs; using exclusion filters in the Datadog UI, they could easily enable or disable indexing for specific types of log data without needing to redeploy their game servers. During the beta, for instance, debug logs were initially excluded since their exceedingly high volume created a lot of noise but not much value. However, when they encountered a matchmaking issue, the game server team found it tough to test or replicate the issue with bots. The engineers turned to logs to find an answer to the matchmaking problem. By flipping the switch and indexing debug logs during the middle of the beta, they were able to analyze and identify the root cause of the problem, and implement a fix within a couple of hours. This ability to dynamically choose whether to index a log (or not) using exclusion filters further solidified Datadog’s value.

The DICE engineering team also loved the tightly coupled integrations between metrics and logs. They had been using tags extensively to categorize their metrics; the ability to use the same tags for logs made the platform efficient for troubleshooting and enabled seamless correlation between metrics and logs. That meant less context switching and manual work for the game server team: they could search through millions of logs and quickly find the specific logs that caused a spike in a metric or just the ones that were related to a matchmaking issue. And Datadog’s easy-to-use UI, which allows for intuitive search and filtering with tags and facets instead of a complex query language, helped to increase user adoption.

DICE’s engineers were able to resolve a number of bugs and scaling challenges during the beta of Battlefield V by using Datadog and the power of Logging without Limits™. This ultimately paved the way for a successful general release three months later in November 2018.

Changing the culture of observability

According to Rohr, “Our older tools for gathering and searching logs were very rudimentary. It usually started with being reactive to an issue, and then trying to find a specific server that had more information, which usually meant downloading full logs and doing advanced ‘grep’ searches.” This method was cumbersome and time-consuming, which meant that people avoided having contact with logs in production outside of specific issues. As a result, only a few people in the organization actually knew how to access and use the log data. “Logs were used mainly as a local debug tool by developers,” he says. “Once the servers hit production, most of the data was quite hidden from developers, which meant that they didn’t pay a lot of attention to the content in the logs and couldn’t really correlate issues.”

Once the developers had access to searchable logs within Datadog, a new world opened up where they could easily search and find anomalies. This led to a natural shift in their logging behavior: developers started to clean up log files and remove irrelevant fields. More importantly, they started adding and enriching their log files with contextual data and fields that would make a big difference at the time of troubleshooting. This led to a change in the culture of observability at DICE—now logs are often their first source to evaluate how games are performing, especially when millions of players are involved in a game. It has made their troubleshooting workflow more efficient—some of the issues that would have taken weeks of searching are now identified and resolved within a matter of hours. With the successful launch and general release of Battlefield V, Datadog Log Management has become an indispensable tool for the DICE engineering team, and is now an integral part of their monitoring and troubleshooting workflow for other titles such as Star Wars Battlefront II.

“Datadog log management helps us see things that we didn’t even know were broken. It helps us identify issues, see how severe they are, how often they occur and then fix them effectively. These are crucial learnings for us from an observability standpoint because we are always striving to make our systems and our teams better” David Rohr, Lead Engineer, DICE

Resources