When your team experiences an outage, the tools you use to respond can make all the difference in how quickly you resolve the problem and avoid it in the future. An effective incident management workflow depends on accessible, integrated tools as well as clear, direct channels of communication. And, even after the matter’s been resolved, documentation and analysis of an outage is vital to preventing similar issues in the future. Often, piecing together all of the relevant information to create these post-incident documents is a manual and time consuming process.
With Datadog Incident Management, your teams can easily create and track incidents within Datadog and collaborate while troubleshooting, reducing mean time to resolution (MTTR). By leveraging the centralized Incidents UI and the ability to declare an incident from different places across the app, along with enhanced features including the Datadog mobile app, the Datadog Slack App, collaborative Notebooks, and the cross-platform Clipboard, Datadog allows you to seamlessly move from triaging possible issues, to investigating the root cause, to resolving and documenting the problem.
Sounding the alarms
The first steps of any incident management workflow are triaging an issue and, if you determine that it needs a full response, notifying the right people. The Datadog mobile app makes on-call life easier by providing easy access to all your Datadog dashboards and monitors, so that once you receive a page you can investigate the offending alert from anywhere.
Datadog already lets you share graphs and notifications across your organization through Slack. The Datadog Slack App makes this even easier with commands to share information directly from Datadog without leaving the chat window. For instance, to share a graph that gives context to a monitor alert, use /datadog dashboard
to select and post a dashboard widget.
Once your team agrees it’s time to escalate, you can begin your response directly from Slack with the /datadog incident
shortcut. This creates an incident and lets you tag it with relevant information like severity, whether or not there’s impact to customers or billing, and what environments have been affected. You can manage your incidents from start to finish from within the Datadog Slack App with commands like /datadog incident list
to see all open incidents, and /datadog incident update
to add a title, assign severity, or update the status and resolve an ongoing issue.
You can also declare an incident inside of Datadog from any dashboard graph using the cross-platform Clipboard, or by going to the Incidents UI. From there you can assign an incident commander and send notifications to necessary responders and stakeholders directly in Slack channels or through other services like PagerDuty or OpsGenie.
Unified incident response
The Datadog Incidents UI provides a central view of all incidents, including both active and resolved. You can filter and sort incidents by key metadata such as team, severity, status, and other information you tagged. Selecting an incident brings you to a timeline containing a chronological list of updates to the issue; for instance, updated tags like a change in the incident’s status from stable to resolved, or tasks that have been added. Team members can contribute links or text to the timeline to provide commentary, context, and other helpful information. For example, anyone can add widgets from dashboards within Datadog that show relevant metrics.
Complete visibility into incident impact
In distributed environments, where there are many potential points of failure, having full visibility into each part of your stack is crucial to quickly identifying the source of an issue. Datadog Incident Management unifies your incident response workflow with the rest of your monitoring platform, so that you can seamlessly pivot from an alert to relevant dashboards, then declare an incident and begin your investigation without losing any context or needing to switch tools. No matter where you are in the incident management process, you can drill down into your logs, traces, network traffic, infrastructure metrics, and more to troubleshoot and find the root cause.
Making the most of post-incident reviews
As important as it is to resolve an outage, it’s just as important to analyze the root cause and take steps to ensure it doesn’t happen again. Our incident response workflow has built-in tools for collaborative documentation so you can learn from the outages you’ve faced. Each incident has a remediation view, where you can create and track post-incident tasks as well as link postmortem documents and Datadog Notebooks.
Datadog Notebooks supports real-time collaborative editing, so your team can work together to document the incident response and investigation process, or write and share postmortems. You can add interactive graphs from any Datadog data source and easily scope them to the exact time range of the incident. Full support for classic markdown also enables you to add rich context, like code snippets detailing how to resolve an issue. If the issue occurs again, you’ll have a full record of the steps you previously took to resolve it.
Incident Management works with Notebooks to automate postmortem creation. Once an incident is resolved, you can generate a full postmortem with the click of a button; Datadog will automatically create and populate a Notebook with the incident impact, root cause, timeline, and post-incident tasks. From here, you can continue to collaboratively edit, or graph and drill down into your data.
Notebooks also are a powerful way to share mitigation and remediation steps in the event similar or related incidents occur, helping to avoid repeating investigations you’ve already performed. You can create runbooks that provide established action items, instructions, and processes for responding to specific types of problems. You can link to webhooks in your Notebook that initiate fixes or rollbacks to automate your response workflows.
Get started today
Datadog’s incident management product provides a streamlined set of features for responding to outages that’s fully integrated into the monitoring platform you already use, letting you seamlessly pivot from your alerts and data to your incident response workflow and back again. We’re working on adding even more features, enhancements, and integrations in the future. If you’re a Datadog customer, you can try out the Incidents UI today, as well as the Datadog Slack App. If you’re new to Datadog, sign up for a 14-day free trial.
For more information on using Datadog Incident Management, reach out to sales@datadoghq.com or your sales representative.