The Service Map for APM is here!
Introducing recovery thresholds for metric alerts

Introducing recovery thresholds for metric alerts

/ / /
Published: November 6, 2017

When a metric’s value is unstable, there’s a risk that your alerts will keep switching on and off, which may create noise and distraction.

Consider a case where you’ve set an alert to notify you whenever your p95 response time crosses a pre-defined threshold for acceptable latency. You know that if your latency spikes, you’ll find out right away. But if the p95 metric hovers right around the alert threshold, the slight fluctuations above and below will generate a slew of notifications.

That’s why we’ve introduced recovery thresholds in Datadog. Recovery thresholds stop flapping monitors from getting in the way of observability, and they increase your confidence that an issue has truly been resolved when an alert recovers.

How recovery thresholds work

The principle behind recovery thresholds is hysteresis—the dependence of a state of a system on its history—which is also how a thermostat regulates the temperature of your home. Your thermostat switches on and off at different temperatures—in other words, it has different thresholds depending on whether the temperature is rising or falling. Without this mechanism, the thermostat would switch on and off every few seconds, which would be inefficient.

Recovery thresholds work the same way, setting a different value for when an alert triggers and when it resolves. If you’ve set a recovery threshold, an alert only enters the “recovered” state once a metric has passed it. But a metric crossing the recovery threshold without first reaching the alert threshold will have no effect.

Setting recovery thresholds

When creating a monitor via the UI, add the recovery threshold when you set your alert conditions. You can set up thresholds for recovery from both alert and warning states. Recovery thresholds apply to threshold alerts, change alerts, and anomaly detection.

Setting an alert threshold

If you’re using the the API, you can add a recovery threshold within the thresholds dictionary in the options argument:

options = {
  'thresholds': {
    'critical': 100, 
    'critical_recovery': 80, 
    'warning': 70,
    'warning_recovery': 60
  }
}

Keep in mind that to make the most of recovery thresholds, you should think about the point at which you’re comfortable in declaring the alert resolved.

If you’re not using Datadog yet, get started with a 14-day .