Managing Alerts at Scale | Datadog

Managing alerts at scale

Author Stephen Boak

Published: March 14, 2017

Hi, I’m Steve, I work on the alerting product at Datadog, and I want to show you a couple things that we’re working on to help manage alerts at scale.

The first is a new Alert Management UI that we’re putting into the product.

And the second is a new type of alert called Composite Alerts.

And I’ll get into how each of these things work.

Alert Management UI

So starting with Alert Management.

This has been a pretty good case study in how we do betas at Datadog and Alexis touched on this a little bit in his talk.

We’ve spent the last couple of months talking to some of our large alert users.

Actually, a lot of you are in the room right now, so thank you very much for your feedback over the last few months.

Maybe it’s because we’re the team that wakes people up at night, but people have been very enthusiastic and quick in their feedback about how we can improve the alerting product.

And some of things we gathered from these interviews over the last couple months is some of the common trends that our customers have.

So our large users have anywhere from hundreds to tens of thousands of monitors in their environments.

They have tens to thousands, in some cases, of users that are sharing Datadog, and those can be across both multiple teams and multiple environments.

So as our customers grow and as our alerting product grows, we need to scale up.

So the goals for this project from the outset were to really improve the experience for this set of users.

And first and foremost, you know, speed being a feature, we have seen page load times drop from sometimes several seconds to 250 milliseconds on a new page.

So we’re really excited about the speed improvements.

And along with the new faceted search that we’re rolling out, we have a lot of increased visibility into coverage across multiple teams and environments.

We’re also introducing multi-edit to make things easier, to tag and to make modifications across all these different monitors.

And all this, at the end of the day, is about giving you more complete coverage and better visibility because that’s what an alerting product is all about.

So since many of you are customers you kind of know how this works today, but I wanted to highlight a few of the things about the current page.

So if I want to search for something like Kafka, in this particular case, it returns about 50 results. And this is because that search is running across all fields the Datadog stores.

This can sometimes lead to some noise and undesirable results returned from search.

So with the new Manage page we’re introducing proper faceted search so these queries are more precise.

The same query for something like Kafka is just gonna be searching across names and descriptions, but we’re now rolling out facets for things like status, the type of monitor, the service tag, the scope, the metric, and the user that created the monitor, and the notifications.

So you can search across all those things. You can introduce advanced queries into these searches and all of this is URL-addressable.

And while I’m here, I wanna talk in particular about one facet, which is service tags.

This is something that relates to our APM product.

One of the things that we’re trying to do with the APM product is carry over service tags that are defined as part of the APM so that you can easily identify monitors associated with those services.

But this is actually been something that’s been available for monitors for a while now in the product, but I wanted to put a special highlight on this, especially because with multi-edit, it’s now easy to tag all of your monitors with service names in Datadog, and it’s something that we think will help you organize these long lists across all your different teams and services.

And the private beta is starting today.

As I said, we’ve been collecting feedback from about a dozen users over the last few months, but I would love to get more of you involved.

I was looking through the list of our large users over the last few days and

I know there are about 10 of you in the room, so if you don’t find me I’ll find you.

And I would love to start gathering more feedback about how we can improve this.

Composite monitors

The second thing I want to talk about is composite monitors. So this is also something that’s kinda been in the product for a while as an API-only feature, but we’re now rolling out a UI for it.

And, in short, composite monitors let you take your existing monitors and combine them with new notification rules to hopefully reduce alert floods and give you more meaningful and useful notifications.

And, I think, the best way to talk about them is to just show a couple of these cases.

Reducing noise with composite monitors

So let’s say I have two monitors, monitor A and monitor B. They target the same host.

One is a monitor on CPU utilization, and one is a monitor on latency.

CPU, of course, by itself is not a problem. In fact, high CPU is generally a good thing—you want your resources to be fully utilized, but if a machine spiked in its resources and then is slowing down and responding to queries more slowly, that can be a problem.

So in our current monitoring system, if these go down, they both start alerting separately…classic alert flood case.

But with Composite Monitors, I can combine these two and say, “With a rule like A and B, I can now collapse that into one notification.”

So if one or both of these metrics goes down, I can get one notification instead of two or however many.

These rules are completely customizable so I can eliminate notifications altogether if I want only a notification if both of these metrics are spiked.

So hopefully we can reduce noise as well.

And I have complete control over the rules so I can with a command almost exactly like this I can, say, A and B, or A or B.

I can combine and, kind of, string these rules together however I want.

Composite monitors in microservice environments

The second use case is something like a microservice environment.

So let’s say I have this string of dependent services during my e-commerce checkout process, and one of those services goes down.

Again, in the current system, each of those services being dependent on one another, all they’re gonna go down and I’m gonna start getting orders for all three.

That sucks.

So with composite monitors, again, I can kind of string those together via custom rules and get one notification instead of however many.

This is what it looks like.

So I go to create a new composite monitor.

I choose my first existing monitor and Datadog’s going to give me a little bit of information about how that monitor is responding right now.

As I add more monitors, and I can add up to 10 of these, Datadog is going to start looking at how they overlap.

So, in this case, they have some common hosts.

And then as I define the rules for the composite, like alert when A and B are both firing, Datadog in real time, as I’m creating that monitor in the UI, is going to tell me what this monitor would respond like.

And, again, the end result of this hopefully is that you can better manage your alert floods and get fewer notifications.

This also is available today.

And you can reach out to support or me directly and we’ll start taking your names and getting you involved in the beta.

That is all I have.

Any questions?