Algorithmic Alerting With Datadog | Datadog

Algorithmic Alerting with Datadog


Published: 10月 26, 2016
00:00:00
00:00:00

Introduction

Great to see all of you here.

As said, I’m Homin, Lead Data Scientist here at Datadog, and I’m gonna be talking about some of our newer algorithmic learning options.

So first, I’m gonna be introducing anomaly detection, which is a feature that is being released today or tomorrow. And this talk is about monitoring metrics through time, so we’re gonna be looking at a metric’s history and use that to see whether or not it’s anomalous.

And then contrast that with outlier detection, which is monitoring a metric through space. So that is, like, are these several metrics close to each other or is one of them very far away and an outlier?

And then I’ll talk a little bit about combining the two, so, how do you monitor metrics using both anomaly and outlier detection at the same time?

All right. So as you all know, Datadog is great for monitoring your services and systems, and we make it painless to set up alerts, right, especially for things like resource metrics, like this space that can be exhausted or metrics you just understand really well like this one where we know that if it goes to 1.5K, things go downhill and then an alert will go off.

But there are plenty of cases where these threshold alerts aren’t sufficient, right?

So say you have a trending metric, this is a metric that always goes down.

You’re gonna have to reset a threshold alert constantly if you wanna catch like these downward spikes, right, because for it to catch that downward spike, it’ll have to be set at a certain level, and then that metric keeps trending downward so you’ll have to reset it, reset it, reset it.

And so you’ll need to reset it all the time or just get lots of false alarms.

There’s also seasonal metrics.

So this metric, in the middle of the every weekday, it spikes.

Probably, it has something to do with some sort of user behavior.

And if you wanna alert on the spike that happens in the middle of the night, a threshold alert is not gonna be able to catch that.

So we have something called change alerts, and some people use change alerts to try to capture seasonal behavior.

So say you wanted to catch this spike that happens in the middle of Thursday, so then you would do a change alert, say how different was this metric at that time of day from 24 hours before?

And you see what’ll happen is, what the change alert monitor actually evaluates, it’ll show that difference as that big spike in the middle of Thursday.

But then you’ll actually also get alerted like on Friday because that big spike wasn’t there anymore.

So you get like an extra alert.

And so while change alerts are great, like it’s not actually like the best use case for something like some seasonal behavior.

Anomaly detection

And then of course, all of these problems are just compounded if you have metrics that are both trending and seasonal, which is the case for most business type metrics.

To be able to alert on seasonal metrics and trending metrics, we have anomaly detection.

All right, so anomaly detection is gonna look at the past behavior of your metric and it’s gonna alert when it deviates from what we think is normal behavior, and it does this in real time.

So the way it’s gonna work is that we’re gonna predict the range of values that we think is normal and we’re gonna represent that by this grey band. Anytime the metric leaves that grey band, we’ll color it red and call that an anomaly.

So it works really well for TimeBoards and ScreenBoards.

You stick it on and you can see, and it’ll just be immediately obvious if your metric is being anomalous or not, and it’s just another function so it’s as simple as just adding a function through the editor.

Setting monitors

And then of course, you can set up monitors.

So if you receive an alert like this, it’s clear we expected this metric to be around five million, and now it’s around six million. You set an alert so that if a certain percentage of the time over this time window, it’s anomalous. You’re gonna get this alert.

And this nice little snapshot shows you that it is, in fact, outside of the expected range of values.

Once you get that alert, what are you gonna do?

Well, you can click through and go to the monitor status page.

And so we show you that corresponding snapshot that you got, but we also give you some historical context, right?

This alone doesn’t really tell you why the band is at five and now that the metric is six, why that’s anomalous.

But if you click through to the historical context part, you can see this metric is pretty much steady at where it is, and then the fact that it’s up there at six million is something that’s anomalous.

Basic anomaly detection

So how do we use that past history to tell you what we think is normal?

So the simplest algorithm we have is called Basic, and it pretty much does the obvious thing: it’ll just look at the immediate past and kind of calculate what the normal range of values is from what it sees in the past and then just draw the bands like that.

And so like instead of using mean and standard deviation, we use quantiles because they’re more robust and it’s great for metrics that are steady or metrics whose levels as they change, change very slowly.

So Basic being somewhat robust, for instance, you’re gonna see the spike like that, it’ll ignore it and continue drawing up the envelope and it won’t be affected by that, which is nice.

But in this particular case, these spikes are like actually regular occurring spikes that happen every half hour.

So you might wanna go to one of our other algorithms.

And so pretty much every other algorithm we have will be able to capture this regular behavior and know that it’s normal.

Robust anomaly detection

So our first non-Basic algorithm is something called Robust, and we just take a page from classical timeseries analysis.

And what we do is we take the original metric, we decompose it into a trend component and a seasonal component, which leaves us with some noise, and that gives us the model for which we consider to be normal behavior.

And the nice thing about the Robust algorithm is that it lives up to its name and it’s very robust.

So by that, I mean, here’s an anomaly that lasts like half an hour, and notice how the band just doesn’t budge, right?

It’s just like, “I know normal is down here. And even though there was this anomalous behavior that lasted for half an hour, I’m gonna continue seeing that normal behavior is down here.”

And this is for better or worse because, say you made some change to your code and there’s a level shift to your metric and you actually want…like now, normal behavior really is 20.

So the flip side to Robust is that even though it was an intended level shift, it’ll continue saying that normal behavior is down here for a long time and continue to call this metric anomalous, maybe for longer than you want.

And just like it’s nice to contrast it with Basic, where Robust is robust and it is robust to that half-hour anomaly, whereas Basic, after a while, starts to incorporate that within the bounds since you see the behavior there where just the band gets fatter after the anomaly happens.

Agile anomaly detection

Our next algorithm is called Agile.

It’s a robust version of the classical SARIMA model for timeseries.

And so the main idea is that we’re trying to predict the next point.

You probably wanna use the points that came right before it to make your prediction.

So this is the same idea as Basic.

But the twist is that you’re also gonna look at like the same time of day, like the day before or a week before or several weeks before, and help that inform you what you think the next point should be, and so you’re able to capture this seasonal behavior as well and also the different trends.

And so this allows it so that the predicted range is sensitive to like immediate trends, so you see like it goes a slight dip there.

And even if this is not like a part of a larger trend, it’ll pick that up and it’ll also pick up the longer term trends and see the .

Adaptive anomaly detection

And then the last algorithm we provide is called Adaptive and it’s a blend of several different predictions, and we are gonna combine them using some diagnostic online learning algorithm.

And it’s useful if you’re trying to use anomaly detection on a series whose behavior changes over time.

And another advantage of Adaptive is that unlike the Agile or Robust algorithms, it requires less history because it’s adaptive and can kind of work with what it has.

So all the algorithms have a single parameter.

It’s just the tolerance and it just controls how fat the band is.

And so it’s just gonna…you should set it according to how much deviation you feel comfortable having.

How we define anomalies

So we chose to define anomalies this way like as drawing a band and something or anything outside of the band being an anomaly.

And the reason we did this is that it’s really clear, right, like visually, you know what’s an anomaly.

It’s anything outside of the band.

And on top of that, you also know like what the algorithm thinks of as being normal behavior because that band says if the metric is between this value and that value, it’s probably like fine.

There is a consequence to this, which is that this definition of anomaly as something being outside of a band plays very strangely with time aggregation.

So here, we see like a week and we see this red spike and like…let’s zoom in.

And so we see that spike, but we also see this like tiny spike right next to it that wasn’t in that picture before.

And I think most people have the intuition that, “Oh, that’s okay,” right?

Like when I was looking up before, each time point was an average or something or a sum of the points around it, and so that little spike got washed out, and people are okay with this.

Now, here’s a less intuitive example.

So here, we see a metric that’s around 8, the bands are drowning around like 9 and 7, and then there’s a spike that goes to 10.

Now, let’s zoom in.

And now, the anomaly disappears.

So what happened?

See, like now the metric is still centered around 8, but now there’s a lot more variance, right, so the metric goes from 3 to 13 all the time.

And then, this might be harder to see, but if you look at that section there, the variance actually shrinks a little bit and the mean goes to like around 10.

And when you aggregate that up, that leads…that’s that spike over there.

And so if you’re trying to use anomaly detection on a monitor and you want to have this like short time window and you don’t care about spikes like that, you’re in luck.

If you do care about spikes like that, you have some options, like you can either smooth the function so that it reduces the variance which will reduce the bands and thus be able to catch spikes like that, or you can use the time roll-up so that you’re using, like looking at the same points, same resolution points that you see in like the zoomed-up version.

But this is just like a consequence between time aggregation and using bands for anomalies.

Anomaly v. outlier detection

So the anomaly ticks we’ve seen so far has been temporal in nature, and we’re comparing a single metric’s values with its past history.

So now, I’m gonna contrast this with outlier detection, which we announced last year, and that considers like the space in metrics.

So intuitively, metrics are close to each other if they have like similar values at every time point.

An outlier detection can tell you when one of them has deviated from those.

This particular example shows a running outlier detection around systems, each line is a host for a particular app, and you notice there’s outlier detection picked at a group that seems like it was doing less work than the others.

And it turns out that just it was running a slightly older system version, and upgrading it, and solved the problem, and they all started doing the work that they should have been doing.

So we have two different algorithms to use.

One is called DBSCAN, named after a clustering algorithm.

And the way it works is you have a bunch of timeseries.

We’re gonna calculate a new median timeseries, where, for each tick, you calculate the median value, calculate the distance between each series and the median series, and then take the median of those distances that’s what we consider to be close.

And then anything that’s close to each other is like the main group, anything that’s far is far.

But let’s like go down to like two dimensions that’s easier.

Let’s say these five points, each of these five points is a timeseries.

So the new median series is like this purple point here.

So notice that it’s like lined up with a third point on the x-axis and a third point on the y-axis.

And then if you look at all the different distances between the gray points and the purple points, the median distance there is that span right there so that’s what we’re gonna be considering close.

And then we’re just gonna draw those balls around, like that distance around all the points and then we’re gonna cluster together anything that’s touching.

And then the biggest one is gonna be what we consider to be normal behavior and then the outlier is anything that’s not within that group.

And so this picture is the same thing but applied to like 60 dimensions as opposed to 2.

The other algorithm is called MAD, which stands for the median absolute deviation from the median, take a slightly different approach.

We find like the median value of all the points, and that’s the thick red line.

And then we calculate the MAD, which is like the Robust version of standard deviation, and that’s like…and then we add that to the median, and so those are those dotted lines.

And then if a certain percentage of the metrics points are outside of those bands, then we’ll call that an outlier.

Combining anomaly and outlier detection

All right, so now that we have both anomaly detection and outlier detection at our disposal, like how do we combine them?

So anomaly detection is like definitely not something you should be applying to all your metrics, right?

It’s most useful for metrics that have like some sort of trend or seasonality that you wanna capture.

An outlier detection, you also don’t wanna be like applying to every kind of metric.

It should be for groups of metrics that should be behaving similarly.

And so the prototypical use case is, say, you have some metric for some application and that has this nice seasonal behavior, use anomaly detection on that and aggregate.

And then if you’re worried about if like a particular host might be going bad, then you also apply outlier detection to the group of host.

And so by monitoring both like the historical time aspect and aggregate and also the space aspect across all the different hosts, you get the best of both worlds and you’re able to monitor them together.

Conclusion and Q&A

So, thanks. I hope you find anomaly and outlier detection to be useful for your monitoring needs.

And I’m gonna be around all afternoon so I’m happy to talk about any particular use cases you might have.

And if you have questions, I’m happy to take them now.

Question:

Does any of these take into account like weekends or weekdays? Because we have a lot of stuff that’s very…you know, weekdays transfer can intend to move around but very similar, but then you go from Friday to Saturday and, you know, Sunday to Monday.

Answer:

Yeah, so the question was if we deal with weekends, and, yes, yes, we do.

So like the kinds of seasonalities we’re looking at are like daily, right, so how does one day compare to the next, and then also weekly, like how does one week compare to the next.

And right now, we look about, I’d say, like six weeks of past history and use that to calculate, figure out like what’s normal at that moment in time.

Yeah, and like, yeah, this is a good example, like, you know, those are the five weekdays and that’s the weekend and we know that it’s like flatter on the weekend.

Question:

Is it possible to leverage the backend procedure function?

Because when we have metrics, some of the metric has some sort of known constant background noise, and then we want to alert base on the real metric minus that constant background noise.

Is it possible to leverage this anomaly detection, too?

Because we can, in principle, subtract some sort of a constant background metric and then alert it base on that, but there’s also a concern that we don’t know, you know, this behavior of this background noise.

And if that constant background noise trend is some sort of positive or negative way, we constantly have to readjust this constant background noise.

Can you address whether the anomaly detection might help?

Answer:

Sure. If the noise like really is this constant, like anomaly detection should have no problem picking it up.

And if the distribution of that constant noise is like slowly changing over time, then that should be fine.Host:

So with that, thank you very much, Homin.