Watchdog for Infrastructure Metrics | Datadog

Watchdog for Infrastructure Metrics


Published: 7月 17, 2019
00:00:00
00:00:00

Celene: Even the best of us encounter downtime.

The minutes spent discovering and resolving an issue can be critical to you and your users.

At Datadog, one of our goals is to help you find your issues, including those lurking beneath the surface—fast.

We’re developing ways to use machine learning to find problems in your services and infrastructure and help to understand them.

Last year at Dash, we announced Watchdog for our APM customers.

What is Watchdog for Infrastructure Metrics?

Watchdog finds anomalies in your services and tells you about them in an easy-to-digest story format.

The best part: there’s no configuration necessary, it just works.

The way it works is it takes key service metrics, analyzes the last several weeks of data, and determines the bounds of normal.

Then it looks at the latest data and decides if these are anomalous, based on those bounds.

This has proven to be so effective in providing useful, actionable information for our APM customers, that we decided to broaden its scope and availability.

So today, I am very excited to bring you infrastructure stories in Watchdog which will make smart-issue detection available for all of our customers.

Watchdog will now find your memory leaks, jumps in TCP retransmits, and problems with integrations, such as Redis, Postgres, NGINX, ELB, and S3.

For example, perhaps there is a slow, undetected leak in memory in one of your application’s hosts, trending towards a tipping point to wreak havoc, or you notice an unexpected spike in latency in your Redis instance, Watchdog will find these and tell you about them, potentially before the realization of a bigger problem.

These infrastructure story types are now available in beta, and there’s more under development, so stay tuned.

In addition to helping you find issues, the data science team considered how we could help to understand them.

Correlations

So today, I am also excited to bring you a new tool called Correlations.

Let’s say you notice a jump in errors or a spike in latency and you want to investigate, you can then use correlations to scan thousands of metrics across APM, dashboards, integrations, and even custom metrics to find related behaviors and get closer to unveiling the root cause.

Correlations is available from Watchdog, as well as dashboards, notebooks, and monitors with just the click of a button.

And you can use facets to quickly hone in on scopes that are particularly compelling to you.

Enough talk, let’s see it in action.

I am pleased to welcome Vishesh Sharma to the stage from Capital One, who will demonstrate how his team has benefited from Datadog’s latest smart features.

How Capital One used Watchdog for Infrastructure Metrics

Vishesh: Hello, everyone.

I’m Vishesh Sharma.

I’m part of the enterprise SRE team at Capital One, where one of my objectives is to integrate tools like Datadog into the day-to-day workflows of Capital One’s engineering teams.

Recently, I have created a dashboard to help incident responders triage symptoms of a problem.

Here, in this dashboard, you can see that a lambda function has started to become really slow.

Let’s run correlations on this issue.

Correlations identified a similar behavior in AWS SQS and AWS Lambda.

So let’s dive into the correlated SQS metrics.

So here we can see that the number of total messages in SQS also has spiked around the same time.

And after seeing the tags, the messages are coming from the ASV network flow automation application.

So this gives us a great starting point to continue our investigation.

Lambdas often talk to each other.

Let’s see what correlations we found in this lambda packet.

So here we can see that the number of invocations is also heavily correlated, and most of them are coming from a payment management application.

So this gives us another strong candidate to work on.

So to summarize, the correlations featured is a great tool for troubleshooting issues.

We can quickly cast a wide net and find candidates for the root cause analysis.

Thank you.