Triggered Alerts in Datadog - Providing Context to Alerts | Datadog

Author Alexis Lê-Quôc

Published: 9月 10, 2013

Have you ever received an alert on your phone or in your email that left you wondering what exactly what was wrong and how critical the issue was with respect to the rest of the application?

Typical alert with no useful information

In this example the alert mentions that a service is warning but omits to tell you whether the issue is picking up momentum, or if it’s just a slow march to failure.

Without context, it’s difficult to gauge whether the issue is a real problem that you must immediately work on and fix, or simply something that you should acknowledge and investigate later.

Ultimately, when analyzing an issue the following questions must be answered:

  • Is this a new or an old issue?
  • Is the alert indicative of a transient or recurring problem?
  • If the issue is recurring, is the time frame between alerts changing?
  • How intensely “off baseline” are the performance metrics that triggered the alert?

Datadog’s new Triggered Alert screen works to make available important contextual information with one click.

For starters, only alerts for which the alerting criteria are currently engaged are revealed. This allows for any investigation or work done for an alert to have an immediate impact, i.e. the alerting condition will cease.

Clicking on any triggered alert reveals key information in a modal drilldown screen such as:

  1. Scope: Which specific servers are affected
  2. Underlying Metric: Which metric and what its behavior has been
  3. History: How long the alert has been triggered. Has it happened in the past? How often?

Administrators can begin a routine sweep of their environment by first accessing the Triggered Alerts page, and triaging system conditions which are causing problems at that exact minute.

