Incident Management With Datadog | Datadog

Incident Management with Datadog

Author Mary Jac Heuman

Published: August 11, 2020

When your application experiences an outage, the tools your team uses to manage its response can make all the difference in how quickly they resolve the problem and avoid it in the future. An effective incident management workflow depends on accessible, integrated tools as well as clear, direct channels of communication. And, even after the matter’s been resolved, documentation and analysis of an outage is vital to ensuring it never happens again. Many incident management apps make the response process more difficult, requiring a learning curve for new tools which hinders coordination.

That’s why we are excited to announce Datadog Incident Management. Now your teams can easily create and track incidents within Datadog and collaborate while troubleshooting, reducing mean time to resolution (MTTR). By leveraging the new centralized Incidents UI along with new and enhanced features including the Datadog mobile app, Datadog Slack app, and collaborative Notebooks, Datadog allows you to seamlessly move from triaging possible issues, to investigating the root cause, to resolving and documenting the problem.

Sounding the alarms

The first steps of any incident management workflow are triaging an issue and, if you determine that it needs a full response, notifying the right people. The new Datadog mobile app makes on-call life easier by providing easy access to all your Datadog dashboards and monitors, so that once you receive a page you can investigate the offending alert from anywhere.

Mobile app alerts and monitors

Datadog already lets you share graphs and notifications across your organization through Slack. Our new Slack app makes this even easier with commands to share information directly from Datadog without leaving the chat window. For instance, to share a graph that gives context to a monitor alert, use /datadog dashboard to select and post a dashboard widget.

Once your team agrees it’s time to escalate, you can begin your response directly from Slack with the /datadog incident shortcut. This creates an incident and lets you tag it with relevant information like severity, whether or not there’s impact to customers or billing, and what environments have been affected.

You can also declare an incident inside of Datadog from any dashboard graph or by going to the new Incidents UI. From there you can assign an incident commander and send notifications to people, Slack channels, and other services like PagerDuty or OpsGenie.

Unified incident response

The Datadog Incidents UI provides a central view of all incidents, including both active and resolved. You can filter and sort incidents by key metadata such as team, severity, status, and other information you tagged. Selecting an incident brings you to a timeline containing a chronological list of updates to the issue; for instance, updated tags like a change in the incident’s status from stable to resolved, or tasks that have been added. Team members can contribute links or text to the timeline to provide commentary, context, and other helpful information. For example, anyone can add widgets from dashboards within Datadog that show relevant metrics.

Incident timeline UI

Making the most of post-incident reviews

As important as it is to resolve an outage, it’s just as important to analyze the root cause and take steps to ensure it doesn’t happen again. Our incident response workflow has built-in tools for collaborative documentation so you can learn from the outages you’ve faced. Each incident has a remediation view, where you can create and track post-incident tasks as well as link postmortem documents and Datadog Notebooks.

Our revamped Notebooks now support real-time collaborative editing, so your team can work together to document the incident response and investigation process using data-driven storytelling. For instance, you can add interactive metric graphs as a visual aid. Graphs in notebooks support all Datadog data sources and can be independently scoped to specific time ranges, so you can visualize an exact point during the incident. Full support for classic markdown also enables you to add rich context, like code snippets detailing how to resolve an issue. If the issue occurs again, you’ll have a full record of the steps you previously took to resolve it.

Notebooks now with real-time collaborative editing

Get started today

Datadog’s new incident management platform, now in public beta, provides a streamlined set of features for responding to outages that’s fully integrated into the monitoring platform you already use. We’re working on adding even more features, enhancements, and integrations in the future. If you’re a Datadog customer, you can try out the Incidents UI today and request access to the Slack app private beta. If you’re new to Datadog, sign up for a .

For more information on using Datadog Incident Management, reach out to sales@datadoghq.com or your sales representative.