Automated Root Cause Analysis With Watchdog RCA | Datadog

Automated root cause analysis with Watchdog RCA

Author Othmane Abou-Amal

Published: January 5, 2021

Since 2018, Watchdog has provided automatic, machine learning-based anomaly detection to notify you of performance issues in your applications. Earlier this year, Watchdog started grouping APM anomalies across your services, allowing you to better understand the scope of the issue. Today, we’re pleased to announce the private beta of Watchdog RCA, which automatically identifies causal relationships between different symptoms across your applications and infrastructure—and pinpoints the root cause. This hands-free approach to root cause analysis enables you to resolve problems faster than ever, significantly reducing your MTTR.

Automatically visualize the root cause of performance anomalies.

Automatically identify the root cause of any Watchdog story

When Watchdog detects an anomaly in your environment, it adds a “story” to your feed, which includes a graph scoped to the period of interest, as well as other contextual information such as stack traces, error messages, and affected services. Watchdog RCA builds on this existing functionality by mapping your applications and infrastructure and learning how these different components interact over time. As new anomalies are detected and your Datadog instrumentation gets deeper, Watchdog is able to use its knowledge of your system to identify the root cause of a wide range of issues, including ones that may be impacting your SLIs. For example:

  • If a service is querying a database more frequently than usual and driving up latency across many other services as a result, Watchdog will enable you to see the increased load—and identify the offending service and resource.

  • If a new service deployment has introduced an increase in latency that is propagating through three other services and causing timeouts in your checkout endpoint, Watchdog RCA will identify the offending deployment, show the full scope of impacted services, and enable you to roll back in a heartbeat.

Automatically identify the code version responsible for increased latency.
  • If your web app search endpoint is erroring out due to an increase in errors in your Elasticsearch cluster, a full disk on one of its hosts could be at fault. Watchdog RCA identifies the exact host that is causing the issue and highlights the offending disk metric, so you can take swift corrective action.

This is only a small sampling of the types of issues for which Watchdog can isolate the root cause, and we are constantly iterating to expand its reach. And because Watchdog RCA is available for stories that appear in the Watchdog feed, within the APM homepage, and in your Service and Resources page, you’ll have access to comprehensive contextual information, no matter where you enter a particular story.

Connect Watchdog stories with alerts from user-defined monitors

Earlier this year, Watchdog started clustering related alerts together, helping you see the connections between different issues in your environment and reducing alert fatigue. Watchdog RCA takes this unified approach to alerting a step further by placing alerts from the monitors you’ve defined yourself within the details page for Watchdog stories, as shown in the screenshot below.

View alerts from user-defined montiros within the details page for Watchdog stories.

Additionally, if you receive an alert from one of your user-defined monitors and Watchdog links it to an existing story with an identified root cause, the root cause will be exposed directly on the Monitor page.

View the root cause for an issue that triggered a user-defined monitor at the top of the Monitor page.

By linking user-defined monitors and Watchdog stories across both areas of the platform, we’ve ensured that you will always have all of the available information for the issue you’re monitoring, without having to switch contexts.

Automate your root cause analysis

By observing your applications and infrastructure, Watchdog is able to automatically detect anomalies, group related anomalies across services, and, with the addition of Watchdog RCA, pinpoint the root cause. Watchdog RCA will become even more sophisticated over the coming months by incorporating more datasets, such as the state of your Kubernetes or Docker containerized environments, Cloudwatch metrics, and Real User Monitoring events, so you can identify more issues across your stack faster than ever before.

Watchdog RCA is now in private beta. If you’d like to register for access, sign up here. You can also check out Watchdog Insights, which speeds up your investigation workflows by suggesting possible issues in data such as traces and logs.

And if you’re not yet a Datadog customer, you can get started with a .