The Monitor

How we use Datadog to get comprehensive, fine-grained visibility into our email delivery system

5 minute read

Published

Share

How we use Datadog to get comprehensive, fine-grained visibility into our email delivery system
Alexa Liaskovski

Alexa Liaskovski

Aaron Kaplan

Aaron Kaplan

Visibility into email performance is indispensable to any organization that counts on its ability to reach people through their inboxes, including Datadog. SREs, FinOps, and many other teams rely on email as a critical channel for communications from our platform, including monitor alerts, usage reports, and service account notifications. At Datadog, we depend on the visibility provided by our integrations for Mailgun, SendGrid, and Amazon SES to optimize our email performance and ensure deliverability.

In this post, we'll take a close look at how we use these integrations internally at Datadog to monitor the delivery of every email going through our app. In particular, we'll explore the custom metrics we use to augment the out-of-the-box (OOTB) visibility provided by these integrations in order to closely analyze email performance and maintain the health of one of our platform's vital lines of communication.

Creating cross-transport visibility for comprehensive delivery tracking

Our integrations for SendGrid, Mailgun, and Amazon SES use webhooks from each transport to collect events data and verify successful email delivery. In the email delivery life cycle, these events begin with the addition of messages to transports' sending queues. From there, they cover a range of eventualities up to and including successful delivery. The key events we track are:

  • Bounces, in which delivery attempts are rejected by receiving servers
  • Drops, in which transports forgo delivery attempts based on previous issues with receiving servers
  • Deferrals, or soft bounces, in which delivery temporarily fails and emails are added back into queues

All of this seems simple enough in principle. In practice, however, the data gets complicated, and we rely heavily on Datadog Log Management to create consistency and eliminate hurdles for analytics and troubleshooting. For example, these transports use inconsistent terminology in their logging of delivery events: SendGrid logs label bounce events Bounced with a type of Blocked, whereas Mailgun labels them Failed, with a reason of Suppress-Bounce. Our integrations for these transports include OOTB logs pipelines that do some massaging of log data upon intake for improved consistency. But we also use Log Management to standardize and enrich these logs, which helps us get a cohesive picture of email delivery patterns across all of our transports. Here's how we standardize transport event names:

SendGrid event namesMailgun event namesAmazon SES event namesStandardized event names
@evt.name:processed@evt.name:accepted@evt.name:send@evt.name:accepted
@evt.name:delivered@evt.name:delivered@evt.name:delivery@evt.name:delivered
@evt.name:dropped@evt.name:failed and @reason:suppress-bounce@evt.name:deliverydelay@evt.name:dropped
@evt.name:bounced@evt.name:failed and @event-data.severity:permanent@evt.name:bounce@evt.name:bounced
@evt.name:deferred@evt.name:failed and @event-data.severity:temporary@evt.name:deliverydelay@evt.name:deferred

We've also customized the OOTB logs pipelines for our email transports. For example:

  • We add warning and error statuses to deferred and dropped events, respectively.
  • We measure the delivery lifetime of each message by calculating the difference between email queuing and delivery times.
  • We extract domains from recipient email addresses in order to tag and group metrics by domain.
An overview of our internal logs pipeline for SendGrid
An overview of our internal logs pipeline for SendGrid.
An overview of our internal logs pipeline for SendGrid
An overview of our internal logs pipeline for SendGrid.

We also use Grok Parsers to analyze the reasons for bounces logged by our transports: Inconsistencies in reason values cause inconsistencies in our tagging, so by using Grok Parsers to extract data such as SMTP codes and remap common error messages to enforce consistency, we're able to get a clearer picture of delivery issues at scale.

A sample Grok Parser from our logs pipeline for SendGrid
A sample Grok Parser from our logs pipeline for SendGrid.
A sample Grok Parser from our logs pipeline for SendGrid
A sample Grok Parser from our logs pipeline for SendGrid.

To facilitate troubleshooting of these issues, we use Saved Views to enable our support team to quickly search logs for outgoing emails. This way, support engineers can jump straight into targeted troubleshooting in the event that a customer reports that an email they were expecting from Datadog is not in their inbox.

Saved View for troubleshooting email delivery issues
One of our Saved Views for troubleshooting email delivery issues.
Saved View for troubleshooting email delivery issues
One of our Saved Views for troubleshooting email delivery issues.

Refining our visibility into email delivery via custom metrics

Enforcing consistency in our email transport logs has helped us create a strong foundation for targeted troubleshooting and analysis. It's also helped us effectively monitor patterns in email delivery with the right level of granularity. Each of our transports generates aggregate metrics that are collected by our integrations, but by using Log Management to generate our own custom metrics that cover the delivery of all emails from our platform, we've achieved improved granularity and enriched our cross-transport visibility. We use our standardized transport logs to generate the following metrics:

Metric NameTypeDescription
email_outgoing.event.acceptedCountEmail accepted by transport for delivery.
email_outgoing.event.allCountTotal count of all email events.
email_outgoing.event.bouncedCountEmail bounced by recipient server.
email_outgoing.event.clickedCountLink within email clicked.
email_outgoing.event.deferredCountEmail delivery failed, reattempt pending.
email_outgoing.event.deliveredCountEmail successfully delivered.
email_outgoing.event.droppedCountEmail dropped based on transport's built-in suppression list.
email_outgoing.event.openedCountEmail opened by recipient.
email_outgoing.lifetime.bouncedDistributionLength of time between email queuing and bounce.
email_outgoing.lifetime.deliveredDistributionLength of time between email queuing and delivery.
email_outgoing.lifetime.deferredDistributionLength of time an email has been deferred.

Here, you can see how these metrics are defined in our UI:

Our custom metrics for tracking email delivery, seen within the Datadog Log Management UI
Our custom metrics for tracking email delivery, seen within the Datadog Log Management UI

We track these metrics in a centralized dashboard for clear and detailed visibility into delivery patterns, such as our total overall bounce and drop rates, the rates and volumes of emails delivered and dropped by message type (e.g., monitor alerts, daily and weekly digests) and recipient domain, and latencies for deferred messages by message type and recipient domain.

internal-monitoring-email-delivery-dashboard
internal-monitoring-email-delivery-dashboard

Tracking the number of emails bounced by SMTP code is particularly useful for helping us understand what's driving issues with delivery.

Detail of our internal dashboard for monitoring email delivery showing bounce data
Detail of our internal dashboard for monitoring email delivery showing bounce data

Ensuring a vital line of communication with our customers

At Datadog, our integrations for Mailgun, SendGrid, and Amazon SES have enabled us to create fine-grained, cross-vendor visibility into one of our platform's vital lines of communication, helping us ensure the timely delivery of everything from monitor alerts to usage reports. Learn more about our email transport integrations—and, if you're new to Datadog, consider .

Related Articles

Integration roundup: Understanding email performance with Datadog

Integration roundup: Understanding email performance with Datadog

Improving trust with Datadog Log Management

Improving trust with Datadog Log Management

Simplify XML log collection and processing with Observability Pipelines

Simplify XML log collection and processing with Observability Pipelines

Centrally process and govern your logs in Datadog before sending them to Microsoft Sentinel or Google SecOps

Centrally process and govern your logs in Datadog before sending them to Microsoft Sentinel or Google SecOps

Start monitoring your metrics in minutes