The Monitor

Find what's driving errors and latency with Tag Analysis

4 min read

Share article

Find what's driving errors and latency with Tag Analysis
Antoine Dussault

Antoine Dussault

When a spike in latency or errors threatens your service’s reliability, identifying the root cause quickly is essential—but not always straightforward. You can group your trace data by tags to see how attributes (such as your service's version or the data center where it's running) correlate with its performance. But your grouping strategy is only an educated guess—you have to define the grouping before you know which dimensions contribute to the performance variations you're investigating, and this trial-and-error workflow can slow down your investigation.

Datadog Tag Analysis removes this guesswork by automatically identifying the tags that best explain increases in latency or errors. This means that you can easily see which tags are statistically significant to the issue you're investigating and group those spans with a single click to home in on the root cause.

In this post, we'll explore how Tag Analysis helps you quickly identify the drivers of latency and errors in your services.

Surface latency drivers automatically

Latency can stem from various issues, but the result is the same: degraded user experience and potential losses in customer retention and revenue. Identifying the cause quickly is critical, but sifting through high-cardinality tags to isolate the culprit can slow down your investigation. Tag Analysis simplifies this process by automatically identifying which tags correlate most strongly with a spike in latency. Say you receive an alert notifying you of increasing latency on your service. You can navigate to the Service Page for a high-level view of service performance. You'll see RED metrics that confirm the rising latency, and you can select the problematic spans to launch Tag Analysis and see which tags are correlated with the errors.

Tag Analysis shows you a scatter plot of span durations that highlights how recent spans differ from the baseline, which is established by the spans that represent your service's typical performance. Additionally, you'll see the top correlated tags and their relative latency across the high-latency and baseline groups. In the following screenshot, Tag Analysis shows that spans tagged v2 are consistently slower than those tagged v1.

Tag Analysis comparing high-latency spans to baseline in the web-store service. The scatterplot shows latency distribution, with location service version v2 as overrepresented in high-latency spans by 36.3 percent. Account groups such as beta and batch_1 also show elevated presence in the subset.
Tag Analysis comparing high-latency spans to baseline in the web-store service. The scatterplot shows latency distribution, with location service version v2 as overrepresented in high-latency spans by 36.3 percent. Account groups such as beta and batch_1 also show elevated presence in the subset.

These correlations can help you better understand the scope of the problem, in this case suggesting that the latency increase could be related to the recent deployment of a new version of the service. Tag Analysis also shows that the account.group tag is strongly correlated with the difference in latency. The beta group—the first users of the newly released version—are overrepresented in the high-latency spans. With this insight, you can avoid a wider impact by halting the rollout before it hits more users. And for a closer look at how the locationservice.version attribute contributes to the latency, you can easily group by relevant tags—for example, to compare the v1 version's latency against that of v2.

Trace Explorer showing grouped span data by locationservice.version for the ShoppingCartController endpoint in the web-store service. Table displays request counts and P95 latency, with version v2 handling fewer requests but showing a higher latency of 3.28 seconds compared to 1.05 seconds for v1.
Trace Explorer showing grouped span data by locationservice.version for the ShoppingCartController endpoint in the web-store service. Table displays request counts and P95 latency, with version v2 handling fewer requests but showing a higher latency of 3.28 seconds compared to 1.05 seconds for v1.

Understand error spikes without guesswork

Errors are another key driver of poor user experience, and it's crucial to watch for them and investigate as soon as they arise—whether via dashboards, monitors, or Error Tracking. To find the scope and cause of the errors you're investigating, you can group error spans by tag, but manually cycling through attributes to find a correlation is slow and inefficient. Now, you can launch Tag Analysis with a single click to automatically surface the tags most strongly correlated with the spike in errors.

Tag Analysis provides a clear, data-driven direction for your investigation. The ranked list of correlated attributes shows how frequently each tag’s values appear in both the error subset and the baseline, helping you see what the affected spans have in common—and where to focus your investigation next.

In the following screenshot, the highest-rated dimension is account.id, suggesting that the errors affect only a specific segment of customers.

Datadog APM Tag Analysis comparing error and OK spans for the ShoppingCartController endpoint in the web-store service.
Datadog APM Tag Analysis comparing error and OK spans for the ShoppingCartController endpoint in the web-store service.

Troubleshooting an increase in errors becomes more manageable once you learn that most errorful spans originated with a specific customer category. This information can guide your investigation, enabling you to quickly rule out irrelevant variables and focus on the attributes that most likely contributed to the errors. For example, recognizing that the bulk of errors are associated with certain customers suggests that you should investigate whether the errors relate to malformed requests, expired credentials, integration changes, or other issues on the client side.

Zero in on high-impact tags with Tag Analysis

By automatically identifying the tags that are correlated with an increase in errors or latency, Tag Analysis helps you quickly understand the root cause. These automated insights surface what’s driving the issue so that you can avoid trial-and-error investigations and focus your mitigation efforts where they matter most.

Tag Analysis is now available in Preview. You can sign up here and see the Tag Analysis documentation to get started. To learn more about how to troubleshoot application performance, see the Datadog APM documentation. If you don’t already have a Datadog account, you can sign up for a to get started.

Related Articles

Optimize cross-platform mobile apps with Datadog RUM and Kotlin Multiplatform support

Optimize cross-platform mobile apps with Datadog RUM and Kotlin Multiplatform support

The first step to fixing what matters: Datadog Error Tracking

The first step to fixing what matters: Datadog Error Tracking

Unify visibility into changes to your services and dependencies with Datadog Change Tracking

Unify visibility into changes to your services and dependencies with Datadog Change Tracking

Analyze the root causes and business impact of production issues with Trace Queries

Analyze the root causes and business impact of production issues with Trace Queries

Start monitoring your metrics in minutes