CircleCI ( www.circleci.com ) is a San Francisco based startup that helps development teams ship better code, faster, through a platform for hosted Continuous Integration and Deployment. This platform provides fast feedback loops that increase developers’ productivity to get new code to customers faster. CircleCI’s clients include Kickstarter, RedBull, and Shopify.
At a more granular level, CircleCI’s platform automatically and immediately tests and deploys code pushed to GitHub. “We have a very cloud-based application running on Amazon EC2,” stated David Lowe, a backend developer at CircleCI, “We have several dozen servers running. However, we scale them up and down very frequently.”
Due to the rapid scaling required for their application’s infrastructure, CircleCI was becoming increasingly frustrated with their existing monitoring solutions’ ability to judge the health and performance of their servers, databases and other IT components. When Lowe began at CircleCI, the team was using several monitoring tools that had been patched together. This required the CircleCI team to spend hours every week deciphering and cross-referencing the outputs of each tool to answer questions like: “How are our queries loading?,” “When the queries are slow, are they all slow?,” or “Is this specific event random?” These extra steps led Lowe to become concerned about how the team was using its time. Additionally, Lowe was unhappy with how CircleCI’s monitoring solution only stored metrics for two weeks. “We like to have the data long enough so if something weird happens we can see when it started, and that number always seems to be longer than two weeks ago,” stated Lowe. Boosting the length that CircleCI could store historical metrics could only be done by pulling the data out and storing it on a separate platform. This was unappealing since CircleCI was “trying to avoid building a monitoring solution ourselves.”
The final straw occurred when CircleCI missed an outage that should have been caught early by its monitoring system. Lowe knew then that he had “hit the limit with [their] tools” and needed to implement a more effective and sensitive monitoring solution that would scale automatically with CircleCI’s growth.
Lowe discovered Datadog through a colleague and decided to try the third-party monitoring system. A key criterion that Lowe had for CircleCI’s new monitoring system is that it had to easily integrate with the tools and services that CircleCI was already using. Datadog integrated easily with a number of services that CircleCI was using, including home-grown systems that required very specific integration points. “The fact that Datadog has a nice API is huge. We got Datadog running in a couple of hours, and deployed it across our production environment,” said Lowe. “We just installed StatsD, started sending metrics, and it just worked.”
In the first few days of trying Datadog, Lowe confirmed that this solution met all of the requirements that CircleCI needed without having to build the product themselves. “Datadog has alerting events and metrics all in one place,” said Lowe. This was a huge plus, since Lowe felt that other solutions were trying to treat monitoring as a multi-faceted problem. “Datadog treated it as one problem” said Lowe, giving him and his team the ability to visualize all the data in single pane.
Due to Datadog’s large set of integrations, CircleCI was able to move away from CircleCI’s patchwork of data visualization solutions and make the switch to Datadog. But, Lowe and his team didn’t stop with data visualization. “We no longer use Nagios for alerting,” mentioned Lowe, “We use Datadog’s alarms, and then we push the data into PagerDuty.”
In order to handle rapidly scaling traffic, CircleCI needed to tell at a glance whether their system was performing well or not. According to Lowe, “Datadog gave us the ability to quickly visualize a fairly large EC2 cluster’s behavior. Visualizing the data is important because it summarizes large amounts of data in small images.” According to Lowe, with CircleCI’s previous system, “it was impossible for us to make alerts, hard for us to make new visualizations across old data, and impossible for us to look back at historical data. So when we got Datadog, we were suddenly publishing graphs that gave us new ways of looking at our data. It was eye opening.”
For example, before CircleCI made the switch over to Datadog, they had a known issue in which some of their API calls were slow. “But we didn’t really have a sense of which ones were slow since we had gigabytes and gigabytes of information to process. All we had before were logs, and logs aren’t good for finding patterns.” said Lowe. Since Datadog placed their alerts, metrics, and events all in one place, Lowe now had the capability to place a number of time series next to each other so that he could see where the spikes and patterns occurred. Not only did this allow them to fix the API issue, but it also revealed previously-hidden problems, which they were then able to fix before customers were impacted.
Going forward, Lowe and his team will evaluate the performance of new code and features by determining what metrics will be measured, and then selecting the right alerts. They will accomplish this by customizing the metrics on the screens in their office that show Datadog, which exist purely to provide constant feedback on CircleCI’s infrastructure. With metrics, alarms and events all in one Datadog dashboard, Lowe’s team has been able to quickly gain the information they need to enhance CircleCI’s testing platform.