Introducing recovery thresholds for metric monitors

Rachel Windzberg

When a metric's value is unstable, there’s a risk that your alerts will keep switching on and off, which may create noise and distraction.

Consider a case where you've set an alert to notify you whenever your p95 response time crosses a pre-defined threshold for acceptable latency. You know that if your latency spikes, you’ll find out right away. But if the p95 metric hovers right around the alert threshold, the slight fluctuations above and below will generate a slew of notifications.

That’s why we’ve introduced recovery thresholds in Datadog. Recovery thresholds stop flapping monitors from getting in the way of observability, and they increase your confidence that an issue has truly been resolved when an alert recovers.

How recovery thresholds work

The principle behind recovery thresholds is hysteresis—the dependence of a state of a system on its history—which is also how a thermostat regulates the temperature of your home. Your thermostat switches on and off at different temperatures—in other words, it has different thresholds depending on whether the temperature is rising or falling. Without this mechanism, the thermostat would switch on and off every few seconds, which would be inefficient.

Recovery thresholds work the same way, setting a different value for when an alert triggers and when it resolves. If you’ve set a recovery threshold, an alert only enters the “recovered” state once a metric has passed it. But a metric crossing the recovery threshold without first reaching the alert threshold will have no effect.

Setting recovery thresholds

When creating a monitor via the UI, add the recovery threshold when you set your alert conditions. You can set up thresholds for recovery from both alert and warning states. Recovery thresholds apply to threshold alerts, change alerts, and anomaly detection.

If you’re using the the API, you can add a recovery threshold within the thresholds dictionary in the options argument:

1
options = {
2
  'thresholds': {
3
    'critical': 100,
4
    'critical_recovery': 80,
5
    'warning': 70,
6
    'warning_recovery': 60
7
  }
8
}

Keep in mind that to make the most of recovery thresholds, you should think about the point at which you're comfortable in declaring the alert resolved.

If you're not using Datadog yet, get started with a 14-day free trial.

Introducing recovery thresholds for metric alerts

How recovery thresholds work

Setting recovery thresholds

Related Articles

Route your monitor alerts with Datadog monitor notification rules

Monitor Oracle NetSuite performance with Continuous AI’s offering in the Datadog Marketplace

How to create an effective paging strategy

Quickly get detailed, actionable context for alerts with Datadog's new Monitor Status page

Start monitoring your metrics in minutes

Get Started with Datadog

How recovery thresholds work

Setting recovery thresholds

Related Articles

Route your monitor alerts with Datadog monitor notification rules

Monitor Oracle NetSuite performance with Continuous AI’s offering in the Datadog Marketplace

How to create an effective paging strategy

Quickly get detailed, actionable context for alerts with Datadog's new Monitor Status page

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes