Watchdog is Datadog’s machine learning and AI engine, which leverages algorithms like anomaly detection to automatically surface performance issues in your infrastructure and applications. Without any manual setup or configuration, Watchdog generates a feed of Alerts—on anomalies such as latency spikes, elevated error rates, and network issues in cloud providers—to help you reduce your mean time to detection.
Watchdog now includes Impact Analysis, which shows you how performance issues in your application may be impacting users. Whenever Watchdog finds a new APM anomaly, it will automatically assess if the anomaly is adversely affecting any web or mobile pages that are instrumented with RUM by analyzing a variety of latency and error metrics submitted from the RUM SDKs. If Watchdog finds that the issue is impacting end users, it will provide information about which view paths and users were affected.
At a glance, Watchdog Impact Analysis helps you determine the scope of a performance issue in a service, including which part(s) of your application are impacted and who the issue is affecting. In this post, we’ll show you how you can leverage Impact Analysis to:
- Quickly prioritize which issues to troubleshoot
- Reduce business impact by understanding which users were affected
- Prevent similar issues from happening in the future by creating detailed postmortems
While Watchdog often surfaces actionable issues within your applications, it may also find anomalies that don’t require immediate intervention. It can be difficult to distinguish between these two cases, especially if you aren’t familiar with the particular services involved.
Watchdog Impact Analysis addresses this challenge by clearly showing you which application performance anomalies are impacting your users. Whenever Watchdog determines that an issue is associated with user impact, it will prominently display details in the “Impacts” section.
For example, let’s say that you see two Watchdog Alerts at the top of your feed. The first tells you that latency is up on the product recommendation service, but the Impact Analysis indicates that this issue is only affecting a few customers.
The second Alert informs you of a faulty deployment of your address service, and Impact Analysis shows you that this issue is affecting significantly more customers and thus should be dealt with first.
Once you’ve pinpointed the most important issue, you can easily troubleshoot by clicking on the Alert to see RUM data and APM traces. Whenever a Watchdog APM Alert involves multiple services, the Alert will include a dependency map that illustrates the spread of the performance issue through your application. If Watchdog is able to determine the root cause of the issue, it will be highlighted on the dependency map.
In addition to helping you troubleshoot issues more effectively, Impact Analysis also provides you with a list of users that were potentially affected by the disruption.
The dropdown list above shows you that out of 1.22k total users who visited the impacted pages during the anomaly window, 183 experienced degraded performance. You can click to see the affected users’ contact information if you need to reach out to them. Additionally, you can click on the view paths pill to see Session Replays that show the impact from the perspective of users who actually experienced the disruption. Seeing what impacted users saw makes it even easier to assess whether the issue is critical or not.
All postmortems should have an “impacts” section, where you can document an incident’s impact on users. You can easily export Impact Analysis data to a Datadog Notebook and supplement it with live graphs and other visualizations. This allows you to quickly create postmortems and follow best practices by centralizing all your data in a single place as you investigate user-facing incidents. Notebooks are also fully editable, so everyone on your team can leave comments and contribute additional data throughout the incident response process.
Watchdog Impact Analysis is now automatically enabled for all Datadog APM and RUM users, so you can immediately get deeper visibility into the real-world impact of service performance issues, easily prioritize troubleshooting efforts, and streamline the postmortem process. See our documentation to learn more. If you’re new to Datadog, sign up for a 14-day free trial.