Prevent Future Technical Issues by Centralizing Alerts, Events, and Metrics | Datadog
CASE STUDY

Prevent Future Technical Issues by Centralizing Alerts, Events, and Metrics

Learn how CircleCI decreased the number of monitoring tools they used by leveraging Datadog’s unified solution

About CircleCI

Based in San Francisco, CircleCI offers a hosted platform for continuous integration and deployment that helps development teams ship quality code. CircleCI’s platform runs over 35 million builds every month across Linux, macOS, and Windows systems and provides fast feedback loops that increase developers’ ability to get new code to customers faster.


Key Results

Time to Value

CircleCI was able to use Datadog’s API and OOTB integrations to easily setup and start monitoring their tools and services in a matter of hours.

Unified Solution

Datadog provides CircleCI with a single monitoring solution, eliminating manual correlation of metrics, traces, and logs when troubleshooting incidents.

Reduced MTTD

Using Datadog, CircleCI easily visualized metric spikes and patterns to identify and fix unknown issues before customers were impacted.


Challenge

CircleCI’s team was using several patched-together monitoring tools. As CircleCI’s application infrastructure scaled, it became tedious to track the health and performance of their servers, databases, and other IT components as they had to spend hours every week manually correlating t he outputs of their existing monitoring solutions.


Why Datadog?

Datadog’s dashboards helped CircleCI automatically filter and visualize metrics, traces, and logs from across their tech stack in one place, saving time spent on manual correlation. Datadog’s unified monitoring solution scaled with CircleCI’s growing infrastructure and allowed them to aggregate large amounts of data in one place. This helped them assess the behavior of their applications as well as identify issues they didn’t know existed so they could improve user experience.


Need: Alerting for Current Issues and Historical Problem Analysis

Due to the rapid scaling required for their application’s infrastructure, CircleCI was becoming increasingly frustrated with their existing monitoring solutions’ ability to judge the health and performance of their servers, databases, and other IT components. When David Lowe began as a backend developer at CircleCI, the team was using several monitoring tools that had been patched together. This required the CircleCI team to spend hours every week deciphering and cross-referencing the outputs of each tool to answer questions like: “How are our queries loading?” “When the queries are slow, are they all slow?” or “Is this specific event random?” These extra steps led Lowe to become concerned about how the team was using its time. Additionally, Lowe was unhappy with how CircleCI’s monitoring solution only stored metrics for two weeks. “We like to have the data long enough so if something weird happens we can see when it started, and that number always seems to be longer than two weeks ago,” stated Lowe. Boosting the length that CircleCI could store historical metrics could only be done by pulling the data out and storing it on a separate platform. This was unappealing since CircleCI was “trying to avoid building a monitoring solution ourselves.”

The final straw occurred when CircleCI missed an outage that should have been caught early by its monitoring system. Lowe knew then that he had “hit the limit with [their] tools” and needed to implement a more effective and sensitive monitoring solution that would scale automatically with CircleCI’s growth.

“ We used to have to go back and dig through logs, but using Datadog, we are able to track live processes to prevent problems.”

David Lowe
Backend Developer, CircleCI

Alerting, Events, and Metrics All in One Place

In the first few days of trying Datadog, Lowe confirmed that this solution met all of the requirements that CircleCI needed without having to build the product themselves. “Datadog has alerting, events, and metrics all in one place,” said Lowe. This was a huge plus, since Lowe felt that other solutions were trying to treat monitoring as a multifaceted problem. “Datadog treated it as one problem,” said Lowe, giving him and his team the ability to visualize all the data in a single pane.

Decreasing the Number of Monitoring Solutions Needed

In order to handle rapidly scaling traffic, CircleCI needed to tell at a glance whether their system was performing well or not. According to Lowe, “Datadog gave us the ability to quickly visualize a fairly large EC2 cluster’s behavior. Visualizing the data is important because it summarizes large amounts of data in small images.” According to Lowe, with CircleCI’s previous system, “it was impossible for us to make alerts, hard for us to make new visualizations across old data, and impossible for us to look back at historical data. So when we got Datadog, we were suddenly publishing graphs that gave us new ways of looking at our data. It was eye opening.”

For example, before CircleCI made the switch over to Datadog, they had a known issue in which some of their API calls were slow. “But we didn’t really have a sense of which ones were slow since we had gigabytes and gigabytes of information to process. All we had before were logs, and logs aren’t good for finding patterns.” said Lowe. Since Datadog placed their alerts, metrics, and events all in one place, Lowe now had the capability to place a number of time series next to each other so that he could see where the spikes and patterns occurred. Not only did this allow them to fix the API issue, but it also revealed previously hidden problems, which they were then able to fix before customers were impacted.

“ We no longer use Nagios for alerting. We use Datadog’s alarms, and then we push the data into PagerDuty.”

David Lowe
Backend Developer, CircleCI

Using Customization to Evaluate the Performance of New Code

Going forward, Lowe and his team will evaluate the performance of new code and features by determining what metrics will be measured, and then selecting the right alerts. They will accomplish this by customizing the metrics on the screens in their office that show Datadog, which exist purely to provide constant feedback on CircleCI’s infrastructure. With metrics, alerts, and events all in one Datadog dashboard, Lowe’s team has been able to quickly gain the information they need to enhance CircleCI’s testing platform.

“ With CircleCI’s previous system, it was hard for us to make alerts, hard for us to make new visualizations across old data, and impossible for us to look back at historical data. So when we got Datadog, we were suddenly publishing graphs that gave us new ways of looking at our data. It was eye opening.”

David Lowe
Backend Developer, CircleCI

Resources