Alert Fatigue: What It Is and How to Prevent It | Datadog

Alert fatigue: What it is and how to prevent it

Author Candace Shamieh
Author Daljeet Sandu
Author Nicolas Narbais

Published: January 4, 2024

As your environment changes, new trends can quickly make your existing monitoring less accurate. At the same time, building alerts after every new incident can turn a straightforward strategy into a convoluted one. Treating monitoring as a one-time or reactive effort can both result in alert fatigue. Alert fatigue occurs when an excessive number of alerts are generated by monitoring systems or when alerts are irrelevant or unhelpful, leading to a diminished ability to see critical issues. Updating your alerts infrequently or too often can cause false positive alarms and redundant alerts that overwhelm your team. A desensitized team won’t be able to detect issues early and will lose trust in their monitoring systems, which can disrupt production and negatively impact your business.

To prevent and minimize alert fatigue, it is important to continuously review and update your monitoring strategy, specifically targeting unnecessary or unhelpful alerts. For larger organizations with thousands of alerts, implementation is best done at a team level to ensure alerts are catered to each team’s specific responsibilities. In this post, we will discuss how you can adopt a continuous improvement methodology and implement a practical workflow that helps you revisit and review your alerts. We’ll cover how you can:

Identify noisy alerts

The first step in this workflow is to pinpoint your noisiest alerts, or the ones that trigger the most. Two common types of noisy alerts are:

  • Predictable alerts
  • Flappy alerts

Predictable alerts

As you review noisy alerts, it’s important to note that predictable alerts form consistent patterns. An alert should only be triggered when an unexpected event occurs. For example, if you review an alert’s regular patterns over the course of a week and you can predict that the next alert is going to occur on Friday between 5:00 p.m. and 6:00 p.m., then it should be marked as a predictable alert and adjusted accordingly. This ensures that you are only notified of unexpected events.

Flappy alerts

Flappy alerts are those that frequently switch back and forth between health statuses, like from OK to ALERT to OK. These alerts can quickly bombard you and your response teams if you’re being notified every time the state is changed. As an example, let’s say you have an alert that tracks disk space usage on a server and is set to trigger when disk usage exceeds 90 percent for more than five minutes. The alert detects a spike and triggers a notification. Your team acknowledges the alert was triggered by an automated backup process, so they don’t take any immediate action. The alert updates its status to OK again once the backup is complete and the disk usage drops below 90 percent. A few hours later, the alert once again triggers and your team finds that there was a temporary spike in disk usage due to a one-time large file transfer, but it’s not a cause for concern and easily resolved by deleting unnecessary files. This frequent toggling between ALERT and OK states due to temporary spikes in disk usage can cause alert fatigue because it generates alerts that don’t indicate critical issues.

When you review potential flappy alerts, determine how long it takes for a health status to change and whether or not it auto-recovers. If you’re being alerted every time the status has changed for a short period of time—like five minutes—or if the issue auto-recovers anyway, then it needs to be addressed.

Take preventative action to minimize future opportunity for alert fatigue

Once you’ve identified your noisiest alerts, you can adjust them to reduce their alerting frequency and increase their overall value. Adjustments are best handled at a team level to allow each team to act independently. In this section, we will discuss the five primary tasks that achieve this. After implementing these changes, it’s important to revisit alert settings periodically when you review your monitoring needs, whether that be once a quarter, semiannually, or another time frame that best fits your team. These tasks are to:

Increase your evaluation window

Setting an evaluation window for your alerts defines the frequency at which your monitoring tool evaluates the relevant data and compares it to your configured alert conditions. One common misconception is that increasing the evaluation window might lead to slower responses or missed alerts, but that isn’t the case. The alert continuously assesses the underlying data, so lengthening the evaluation window will ensure that the system considers more data points before deciding if an anomaly exists. In other words, you can increase the evaluation window so that you are only alerted if a behavior is happening consistently—not temporarily.

Setting a short evaluation window for alerts might seem like a proactive approach to catch anomalies quickly. However, this can increase the occurrence of false positives. For instance, consider a database that occasionally experiences brief surges in query times due to harmless background tasks. If the evaluation window is too short, even these spikes can trigger alerts, disrupting your team’s flow of work.

Timeseries graph to highlight the significance of an evaluation window.
Setting too short an evaluation window for alerts may lead to false positives during short, intermittent spikes.

Add recovery thresholds to your flappy alerts

Recovery thresholds are optional thresholds that you can add to a flappy alert, particularly those that trigger multiple times in a row during a short period, like over the past 15 minutes. Recovery thresholds include additional conditions to help you verify that an issue is resolved so you don’t receive unnecessarily disruptive alerts. The recovery threshold is satisfied whenever the recovery condition is met.

For example, let’s say you’ve set up an alert that notifies you once a server’s CPU usage exceeds 80 percent and added a recovery threshold that triggers when CPU usage is below 70 percent. When one of your servers experiences a temporary spike in CPU usage that reaches 85 percent, it triggers an alert. You already have auto-scaling enabled, so your server’s workload quickly decreases and CPU usage falls below 70 percent. Your monitor detects the decrease and appropriately switches the alert from ALERT to OK without any manual intervention.

Consolidate alerts with notification grouping

With notification grouping, you can define how alerts should be grouped or choose to group them by a specific dimension, whether that be by service, cluster name, host, device, or something else entirely. If you choose to group alerts by multiple dimensions, then your alert will send you an individual notification for each entity that meets the alert threshold.

Alerts with multiple dimensions can quickly become overwhelming if you’re monitoring a large group of entities. For example, let’s say you’re monitoring for latency across a group of critical services, each running on multiple hosts. You set up one alert with two dimensions: host and service. With this setup, you can potentially receive a wave of notifications whenever a single host becomes unavailable, alerting you to the host’s state and the service’s state. By removing host from the alert dimensions, you’ll group notifications at the service level and reduce the number of alerts you receive. This notification grouping allows you to have granular queries that still look at service- and host-level information but reduces the noise to a minimum by alerting once when the service meets the threshold. This enables you to investigate without being flooded with redundant alerts.

Illustration for notification grouping. One notification is sent when an entire service meets the alert threshold, regardless of the number of individual hosts that meet the threshold.
By grouping alert notifications at the service level, you can avoid receiving alerts from each individual host running the service experiencing an issue.

Leverage conditional variables

You can create conditional variables for your alerts that will modify their message or recipients based on defined criteria. Because conditional variables use advanced logic, you can utilize them to route notifications to specific teams. For example, you can ensure that any issue that arises in cluster_1 only notifies team_1, cluster_2 notifies team_2, and so on.

You can also increase an alert’s relevance by adding notification messages to your conditional variables. It can be frustrating to receive an alert and have to comb through a dozen or more conditional variables to figure out how to resolve a problem. Notification messages reduce alert fatigue because you and your team know that when you receive an alert, the message will provide you with information directly related to the problem. In the example below, we’ve configured the alert to look at the name of the cluster that emitted the alert. If it’s cluster_1, it displays a specific message, and notifies the development team that supports cluster_1.

{{#is_exact_match "" "cluster_1"}}
    <This displays if the cluster that triggered the alert is 
    exactly named cluster1.> <>

Schedule downtimes

If you and your team are planning any maintenance, upgrades, or system shutdowns, you can schedule downtimes so that all alerts and notifications are appropriately silenced.

Scheduling downtimes in the Datadog application

Downtimes work well for repetitive, predictable alerts because they are flexible—they don’t have to be all or nothing. You can schedule downtimes as one-time or recurring events and choose whether to silence all alerts or specific alerts.

Reduce alert fatigue with Datadog

Datadog’s out-of-the-box Monitor Notifications Overview dashboard displays your noisiest alerts and provides a breakdown of alert trends, enabling you to compare your current trends with historical data. The Monitor Notifications Overview dashboard also shows a live summary of your current alerts. If preferred, you can create custom dashboards to focus on what is most helpful to you, like a specialized dashboard that showcases all alerts that notify via PagerDuty. Our integrations with services like Jira, ServiceNow, and Slack, as well as tools like Datadog Case Management, will help you communicate efficiently as you track, triage, and troubleshoot issues.

View of Datadog's preconfigured Monitor Notifications Overview dashboard

Datadog provides extensive options for alert conditions so you can precisely craft combination alerts—which we call composite monitors—that can be tailored to your specific use cases and data sources, reducing the probability of alert fatigue. For example, you can set an alert that only triggers when the error rate for your web application is above 3 percent and there have been more than 100 hits on the backend in the past 30 minutes. A composite monitor will not alert your team if the error rate slightly rises above 3 percent due to a sudden surge in user activity, as long as the backend load stays below 100. Our anomaly monitors also offer features like automatic thresholding and adaptive alerting.

With Datadog, it’s easy to update your evaluation windows and frequencies. For example, if your evaluation window is less than 24 hours, the evaluation frequency of your alert will remain at once per minute. This allows you to extend your evaluation window freely without the risk of receiving delayed alerts. Use this information to determine how frequently your data needs to be evaluated. With Datadog, if the evaluation window is set to over 24 hours, the data will be checked every 10 minutes—or every 30 minutes if the window is set over 48 hours.

You can use template variables to make your notifications more informative and add custom messages to your conditional variables to make it easier for your team to pinpoint the root cause of issues. Assigning tags to your alerts can simplify notification grouping and are especially useful when you’re dealing with multiple teams or large organizations because you can filter, aggregate, and compare your telemetry in Datadog. If you set up our recovery thresholds for flappy alerts from both WARN and ALERT states, you’ll increase confidence that the alerting metric has successfully recovered and that the issue is resolved without the need for human intervention.

Datadog can also proactively mute alerts on your behalf. Specifically, if you shut down any of your resources, Datadog automatically mutes any alerts related to them so that you won’t receive any unnecessary notifications. For example, Datadog will auto-mute alerts related to your Azure VMs, Amazon EC2 instances, or Google Compute Engine (GCE) instances when their respective autoscaling services terminate the resources. Furthermore, Datadog enables you to choose what alerts you want to mute at any time, create a one-time or recurring downtime schedule, and notify your team of upcoming downtimes and status changes.

Solve your alert fatigue

In this post, we’ve defined alert fatigue, explained why it’s a problem, and confirmed that the cause is often a monitoring strategy that requires a more defined continuous improvement methodology. Reviewing your alerts regularly can reduce alert fatigue and prevent it from recurring in the future. We’ve also walked through how to find your noisiest alerts, the steps you can take to address them, and how you can use Datadog to prevent alert fatigue.

If you’re new to Datadog and want us to be a part of your monitoring strategy, you can get started now with a .