Automate Incident Response Workflows With Eventarc and Datadog | Datadog

Automate incident response workflows with Eventarc and Datadog

Author Thomas Sobolik
Author Rachel Groberman

Published: August 2, 2022

Eventarc is a Google Cloud offering that ingests and routes events between GCP products, such as Cloud Run, Cloud Functions, and Pub/Sub, making it easy to build automated, event-driven workflows in complex environments. By taking care of event ingestion, delivery, authorization, and error handling, Eventarc reduces the development overhead that is required to build and maintain these workflows and helps you improve application resilience.

We’re pleased to announce the launch of a Datadog source in Eventarc, which allows Datadog and Eventarc customers to configure any Datadog monitor to kick off Eventarc-driven workflows. In this post, we’ll discuss how our Eventarc integration can help teams auto-remediate issues, quickly gather contextual data for incident response, and analyze historical trends in their triggered alerts.

Trigger auto-remediation workflows

Datadog’s Eventarc integration enables customers to connect Datadog monitors to Eventarc triggers, which can be used to kick off complex workflows that use custom combinations of GCP products, such as Cloud Run services, Cloud Workflows, Google Kubernetes Engine services, and Cloud Functions. These workflows can be configured to perform auto-remediation steps in response to critical issues flagged by alerts, significantly reducing your MTTR.

For example, let’s say you manage a service that processes customer payments in a web application. Your service uses a quota to limit the number of new jobs it handles each hour in order to control your cloud costs. Nevertheless, it’s important to create a monitor that alerts you when the limit is surpassed, as quota exhaustion errors can lead to performance degradation.

Configuring a monitor to trigger a GCP workflow using Eventarc

By using this alert to activate a trigger in Eventarc, you can kick off a workflow to automatically remediate the problem. This workflow might first notify responders (via Slack, email, or PagerDuty) about the quota exhaustion and request approval for a quota increase. Upon receiving approval, Eventarc might then execute a Cloud Function to temporarily raise the quota. Finally, the workflow could send another notification to the on-call team to confirm that the remediation has been completed.

Gather context for incident response

In addition to automating incident response workflows, Eventarc can also be used to collate metrics, logs, and important metadata (such as transaction IDs and container names)—and append them to your incident tickets. By automatically gathering this contextual data in response to a triggered Datadog alert, Eventarc can help on-call engineers minimize context switching and reduce their MTTR.

For example, let’s say our payment service from the previous section has been compromised by attackers. You’ve configured the Datadog monitor that alerted you to the attack to kick off a workflow that prepares data for your on-call team’s response. The workflow invokes a Cloud Function that gathers audit logs, recent requester IPs, and threat intelligence data and writes this information to a Google Cloud Storage bucket. The workflow then appends a link to this bucket to the newly created incident ticket, and pings responders to let them know what data has been successfully made available. As a result, incident responders get this important context automatically and can ultimately remediate the problem more quickly.

Perform alerting analytics

Datadog’s Eventarc integration can also be used to configure an analytics workflow in GCP that continually logs and processes alert data as your monitors are triggered. This ability to monitor alert activity enables you to surface historical trends and gain insights into the overall health, availability, and performance of your services.

For example, let’s say you want to use alert data to analyze the availability of the payment service we’ve been discussing. For each monitor that tracks when an endpoint on the service stops handling requests, you can include an Eventarc trigger that will feed data from the alert, such as the time it was first fired, its total duration, the relevant endpoint ID, and the related HTTP error code—and write this data to a table in BigQuery. The workflow will then trigger a custom Cloud Function to run an analytics job on the BigQuery table that looks for trends, such as specific endpoints that are failing more often than others. Finally, the workflow will ping the service owner on a routine basis with a report describing the analysis.

Get started with Eventarc and Datadog

Datadog’s Eventarc integration enables you to automate incident response and analytics processes by executing choreographed workflows in response to triggered monitors. This integration is now available in a public preview for Datadog and GCP customers; see GCP’s dedicated codelab tutorials for more detailed information on getting started. If you’re brand new to Datadog, sign up for a to get started.