Speed up your root cause analysis with Metric Correlations

Author Lior Belenki

Published: December 20, 2019

In a world where the applications we run are constantly changing, the number of monitored metrics and events is skyrocketing, and responsibility for system components is fragmented across teams, it becomes increasingly difficult to pinpoint possible root causes of an issue in a timely manner. To address this challenge, we’re introducing Metric Correlations, which automatically finds candidates for the causes of an issue by searching your system for correlated metrics.

When you notice an abnormal change in a metric, Metric Correlations searches for irregularities in other metrics over the corresponding time period. Rather than manually browsing through dashboards and plotting metrics to discern trends, you can let Datadog provide clues automatically for more efficient root cause analysis.

Start your investigation on the right foot

You can launch Metric Correlations from multiple entry points, including dashboards, monitors, and notebooks. Another entry point is Watchdog, which automatically detects abnormal trends within a metric—this means that Datadog can both surface potential problems and guide your root cause analysis. Any time you notice—or get notified about—an irregular change in a metric, you can easily get leads for your investigation.

In the example above, we have seen that Watchdog has detected an unexpected increase in latency for customers applying coupons to a shopping cart in an online store. We can run Metric Correlations from within the Watchdog story to investigate the lag.

Understand the full extent of a problem

Metric Correlations helps you discover the full scope of a problem and its side effects, so you can quickly find the path toward remediation. For example, in the dashboard below, we want to know why the number of completed checkouts in our online store has dropped.

We can run Metric Correlations right from the graph, and it will scan thousands of metrics from different sources, including:

Metric Correlations groups results by source to help you see what components in your system might be involved in an issue. You can get more information about each result by hovering over it. Once you know which other sources could be part of the issue, you can drill down into the results, starting from the source you think is most likely to be related to the issue.

In the example above, we see abnormal behavior in the “Checkout Funnel Tracking” dashboard. If we click on that group, we can see a more detailed view of the correlated results.

Now we learn that the percentage of abandoned shopping carts spiked around the same time, as did the amount of time spent “before checkout” on the website (waiting to check out).

From the sidebar, we can easily navigate to view correlations found from other sources (e.g., APM services, dashboards, or integrations). For example, we can click on “web-store-mongo” to view metric correlations from the service that pulls added cart items from the MongoDB data store.

We can now see a correlated spike in our web-store-mongo service’s request latency metric. This helps explain why shoppers spent more time waiting for their checkouts to process—which ultimately led to a higher number of abandoned shopping carts.

Focus your investigation

By default, Metric Correlations will automatically define an area of interest—the earliest and latest times it will search—based on abnormal values of the metric you’ve selected. You can adjust the area of interest by dragging the handles within the graph.

You can also tailor your Metric Correlations search to include custom metrics in addition to the default sources, giving Metric Correlations comprehensive reach within your system. And if you want to investigate a specific environment, service, or part of your infrastructure, you can scope your search to metrics that include a tag of your choice.

Get correlating!

