When monitoring highly distributed applications, which might rely on hundreds of services and infrastructure components across multiple cloud-based and on-premise environments, identifying problems and pinpointing the origin of an issue can be challenging. Even if you already have robust monitoring and alerts, your infrastructure and applications will likely change over time, which may make it difficult to reliably detect irregular behavior. To meet this challenge, we developed our Trace Outliers feature.
Trace Outliers watches the analyzed spans returned by your App Analytics queries in real time, and scans all available tags in the results. It automatically detects which tagged objects from your environment—users, hosts, services, etc.—are associated with higher-than-usual error rates. This feature enables devops teams to quickly identify issues and dig deeper into the relevant traces and associated services, infrastructure components, and profiles to discover possible root causes.
Datadog’s Trace Outliers feature is embedded in our App Analytics UI, so you can investigate analyzed spans and outlier data in a single view. Simply navigate to the App Analytics view in your account, and use tags like
env to filter your spans. If a notification appears in the sidebar (highlighted below), click on it to see any outlier behavior and a list of tags that appear in analyzed spans exhibiting that behavior.
Clicking on any of the tags listed will let you drill down into their traces and view other associated tags for wider context. Once you’ve identified where errors are originating, you can start taking steps to resolve the issue. For example you can check the health metrics of the service throwing the errors to see if it’s the result of a bottleneck and determine whether to provision more resources.
Even with comprehensive monitoring and a robust set of alerts, you can still encounter challenges spotting issues, knowing who those issues affect, and identifying their root cause. Trace Outliers helps discover unexpected and erroneous behavior in your applications and infrastructure.
One common hurdle you’ll encounter while managing a multi-tenant application is knowing which infrastructure components are underperforming and what customers might be affected. If a customer says they’re experiencing issues with an application, identifying the exact resources the customer is using, which might be provisioned across thousands of hosts, partitions, and shards, can be difficult and time consuming. In App Analytics, analyzed spans can include tags so that they are associated with both the underlying infrastructure components and the users. This way, you can easily see the user whose request generated the analyzed span and the resources behind it.
Trace Outliers scans analyzed spans for error patterns, surfacing any tags attached to a disproportionate number of spans from traces ending in errors. For further convenience, Trace Outliers groups together the tags that often appear together on the same erroring traces. This means that if Trace Outliers detects that a specific tagged infrastructure component is correlated with unusually high error rates, it will quickly identify tagged users with similar error rates that are also associated with those components, so you can immediately know where to focus your efforts.
Whether identifying the resources behind a customer’s issue, or finding customers affected by underperforming infrastructure, Trace Outliers will save you time and effort by quickly identifying which elements have similar error rate patterns.
Feature flags allow engineering teams to safely test new features and deliver additional functionality to their users. When you enable a feature, you’ll want to know quickly if it’s behaving as expected or if it’s affecting application performance. Trace Outliers can immediately identify whether analyzed spans tagged with a new feature flag are showing performance degradation. This lets you quickly course correct and roll-back the feature or troubleshoot as needed.
Our Trace Outliers feature is powered by artificial intelligence for IT operations (AIOps), which makes it easy for teams to spot unusual behavior and pinpoint the origin of an issue in their applications and more than 400 integrations. Trace Outliers is one of Datadog’s many AIOps-driven features along with Watchdog, Metric Correlations, and Log Patterns. Monitoring your system with Datadog’s AIOps-driven features can save you time detecting potential issues by quickly identifying abnormal behavior so you can start troubleshooting and reduce MTTR.
If you are not already using Datadog, sign up today for a 14-day free trial.