Identify the Root Causes of Issues and Bottlenecks in Your Build Pipelines With TeamCity and Datadog | Datadog

Identify the root causes of issues and bottlenecks in your build pipelines with TeamCity and Datadog

Author Nicholas Thomson
Technical Content Writer
Author Anjali Thatte
Product Manager

Published: March 22, 2023

TeamCity is a CI/CD server that provides out-of-the-box support for unit testing, code quality tracking, and build automation. Additionally, TeamCity integrates with your other tools—such as version control, issue tracking, package repositories, and more—to simplify and expedite your CI/CD workflows.

Datadog has revamped our TeamCity integration to provide a more powerful user experience with greater visibility into your pipelines, and increased control over how you monitor them. The integration includes a wealth of new metrics from your build pipelines, such as failed build count, job duration trends, and the number of builds at different stages in the pipeline (queued, processing, finished, etc.). Once you’ve enabled the integration, these types of metrics will start streaming into an out-of-the-box dashboard in Datadog that includes a monitor summary to alert you of high failed build counts for various build configurations. Additionally, the dashboard includes other critical telemetry, such as build-related events and logs, to give you a more holistic view of your system and multiple sources of context to draw from when you troubleshoot pipeline issues.

The unified visibility that the dashboard offers helps you understand where issues and bottlenecks in your pipelines arise—before these issues turn into more serious problems, such as failing or slow builds, that can hamper the velocity of your development team.

The TeamCity integration comes with an out-of-the-box dashboard

In this post, we’ll show you how to:

Drill down into build status alerts to detect issues with your code deployments

Say you are an engineer on an e-commerce site, where you are part of an agile team that uses TeamCity to manage CI/CD. You are deploying a new piece of code to update the site’s checkout experience, and you want to ensure the update does not cause any other parts of the existing code base to break. However, when you push a commit, it triggers a failed build. To investigate why the build failed, you navigate to the out-of-the-box TeamCity dashboard in Datadog.

The Overview widget shows you service checks alongside pipeline alerts

The Overview widget shows you a high-level view of your pipelines by displaying all active pipeline alerts alongside service checks. There are a high number of alerts on build statuses in the past day, indicating that there might be a bigger issue beyond the scope of your individual build. You can click on the custom monitor alert on your build to view more detail on the alert status.

Click on an alert to view details

Because you’ve set up tracing on your TeamCity pipelines with Datadog CI Visibility, you can check the pipeline in question for more information on why so many builds are failing. You click on the red monitor summary, which brings up a list of the builds in your pipelines that have triggered alerts.

View the builds in your pipeline that are triggering alerts

From there, you select the Checkout Service Build, which brings you to the CI Visibility page. Here you can drill into a flame graph visualization of the build and find a job that is returning an error. Then, you can select the Errors tab to dive into the error message, which will likely give you some insight into the issue—for example, that the job is erroring out because of a typo in a recent code deployment. In this hypothetical scenario, it’s plausible that the typo would be the root of the elevated number of build failures. And with this type of information in hand, you could remediate the underlying issue.

Eliminate bottlenecks at different stages of the pipeline

In addition to the out-of-the-box monitors that come with the integration, Datadog also enables you to set custom monitors on your TeamCity pipelines so that you’re alerted when issues specific to your application occur. To continue our example from above, say you’re an SRE at an e-commerce site and you get an alert that latency is high on one of your pipelines. You look at the TeamCity dashboard to investigate and notice that there are a higher number of builds that are in the queue waiting to be started than expected for this particular point in the pipeline. Relatedly, there are also fewer builds than expected that are currently running, and the Builds widgets show you that job duration for one of your build configurations spiked between 12:00 and 12:30.

The builds widget gives you a high-level view of your build pipeline

Taking these pieces of evidence together, you hypothesize that one of your build configurations is taking too long to complete, causing a bottleneck that is preventing other builds from moving forward in the queue. Let’s say in this case that you check the logs section of the TeamCity dashboard and discover a spike in error logs. Let’s also suppose that you open one of these logs to investigate, and the log message tells you that this particular operation is timing out because it is failing to connect to an external library due to a lack of permission caused by an expired API key. Armed with this knowledge, you could update the API key and get the build succeeding in a timely manner.

Correlate build events and logs with performance metrics to more effectively troubleshoot

The TeamCity integration enables you to see logs and events from your CI/CD pipelines in the same pane of glass as your metrics, so you can more effectively troubleshoot problems you’ve identified through trends in your data. TeamCity logs and events appear in the Logs and Events widget in the dashboard. The log stream shows you the number of logs per type (info, error, OK, warn, etc.) so that you can see at a glance which types of events are being generated by your TeamCity pipelines. The log metrics inform you of issues in your pipelines with a quick visualization comparing the rates of error and warn logs to info and ok logs, and the events stream helps you debug issues by providing detailed context about the associated events.

View a breakdown of the most common log types in your system alongside the event stream

For instance, say you see a spike in error logs. This likely signals a problem that you want to investigate further. You turn to the event stream and find that a large number of failed build events are coming from a single Kubernetes cluster.

Easily pivot to the event stream to investigate issues further

Because the TeamCity events are tagged with the host name, you can click the Infrastructure tab on the event to view the host, then click on the host dashboard to view a detailed breakdown of the cluster’s health. Let’s say the Network Traffic widget shows you that communication in the cluster flatlined earlier in the day. You might then check logs from the cluster and find that there has been a recent update to your cluster’s network policy right before the nodes in this cluster started to lose networking. With this likely cause in mind, you could then roll back the deployment until the appropriate team could troubleshoot it, bringing the nodes back online and once again enabling them to host your pipeline builds without issue.

Leverage deeper visibility to identify bottlenecks within your pipelines over time

The TeamCity integration in Datadog provides full visibility into build pipelines and system health metrics, including job duration trends, the number of builds at different stages of the pipeline, and overall system resource allocation patterns. This data allows you to identify bottlenecks within your pipelines and more efficiently troubleshoot issues as they arise. These metrics are brought together in the out-of-the-box dashboard with a range of key telemetry to help you root-cause and remediate problems affecting your pipelines.

If you’re new to Datadog, sign up for a 14-day and see firsthand how you can take advantage of the TeamCity integration today.