Datadog alerts are commonly used to identify dips, spikes, or unhealthy trends in your metrics—for example to detect memory rapidly running out, or a dramatic drop in requested work. Alerts are built to automatically notify the right people via the communication tools your teams use, such as PagerDuty or Slack.
But what if you want to be alerted when a specific event occurs?
- A deploy failed
- A job did not run correctly
- Batch data processing did not successfully complete within a certain time
- A 3rd party service announces an outage
- A new security group was created
- A customer converted
With Datadog’s new event-based alerts, you can trigger alerts on these types of events and more, exactly like you would on metrics or service checks.
This new feature works with any integration which sends events to Datadog. The screenshot below shows Datadog alerting on an unexpected crawler failure:
Flexible and precise event detection
You want to be alerted only for events that really matter to you and your teams. So we spent several months beta testing with dozens of Datadog customers to ensure that event alert definitions are flexible, precise, and easy to understand. You can combine the following filters to select very specific events to alert on:
- String matching: search for any substring in your events’ title, message, comments, users, and so on
- Status: error, warning, info, or success
- Priority of the event: normal, or low
- Source: hosts, applications, custom events
- Tags: include or exclude specific integrations, hosts, or other scopes
Then you can customize the conditions that will trigger the event alerts:
- Aggregations let you alert, for instance, if more than 10 errors have been reported within the last two hours
- Absence lets you notify when, for example, an event that appears when a critical job runs successfully hasn’t been reported in more than 30 minutes
Set up alerting that triggers based on the occurence of specific events with Datadog.
All the power of Datadog alerts
All Datadog alerting features now work with events too. Below is an example of how to use event-based alerts to know if Chef fails too frequently:
You can see a histogram and a stream with all the matching events over the selected time window. Here are the conditions used to define this event alert: