Stay Up to Date on the Latest Incidents With Bits AI | Datadog

Stay up to date on the latest incidents with Bits AI

Author Jordan Obey
Author Kai Xin Tai
Author Maya Perry

Published: 4月 10, 2024

Since the release of ChatGPT, there’s been growing excitement about the potential of generative AI—a class of artificial intelligence trained on pre-existing datasets to generate text, images, videos, and other media—to transform global businesses. Last year, we released our own generative AI-powered DevOps copilot called Bits AI in private beta. Bits AI provides a conversational UI to explore observability data using natural language.

Today, Bits AI is generally available within Datadog Incident Management, our seamless, end-to-end offering that enables DevOps teams and SREs to quickly detect, investigate, and resolve service disruptions and other incidents. Generative AI for incident management is a common sense use case because it can swiftly analyze data from several different sources to flag, categorize, and prioritize incidents, making it easier to jump-start incident response.

Consider one of the most common and frustrating pain points DevOps teams and SREs face. When an incident that needs urgent attention suddenly emerges, teams go back-and-forth on Slack threads that are often dozens or even hundreds of messages long. New responders have to parse large volumes of messages to find relevant information about the current status of an incident and what actions have already been taken. This is in addition to the time it takes to surface similar and related incidents and gather any other key data that can help aid mitigation efforts.

In this post, we’ll look at how Bits AI for Incident Management solves this challenge by:

Get auto-generated incident summaries

In Datadog, every declared incident has its own Details page with an Incident Timeline that logs the actions taken to resolve the incident. With Bits AI, you will automatically get a quick summary of everything in the Incident Timeline as soon as you join the incident’s dedicated Slack channel. Bits AI saves you from switching between applications and searching through timelines for relevant information so you can stay focused on troubleshooting. And for the best experience, we recommend that you enable Slack Mirroring, which pushes Slack messages to the Incident Timeline and allows Bits AI to include those messages in a summary. With mirroring enabled, instead of scrolling through your Slack messages to understand the state of an incident, Bits AI will tell you everything you need to know from the jump.

Bits AI incident summary

Summaries generated by Bits AI keep you up to date and contextualize incidents around key details such as their nature, impact, contributing factors, and what steps have already been taken to resolve them. These summaries live in our Action Tray that provides a list of shortcuts for actions you can take directly from Slack to speed up collaboration and remediation, such as paging on-call personnel, joining a Zoom call, and hopping onto CoScreen.

Having Bits AI automatically summarize incidents for new responders not only allows teams to quickly understand the context and scope of an incident, but it also prevents other team members from personally needing to update others on an incident’s status. Now, all hands can be focused solely on remediation, significantly reducing your mean time to resolution (MTTR).

Use natural language queries to surface key information

DevOps engineers and SREs can interact with Bits AI in natural language to request key incident information by simply starting a message with @Datadog in Slack to aid their troubleshooting. With Bits AI, you can quickly get ad hoc incident summaries, ask about related incidents, and perform incident management tasks on the fly.

Read fresh incident summaries at any time

Interacting with Bits AI as an always-on assistant helps speed up problem-solving and saves you time from having to hunt down relevant data yourself. For example, let’s say you joined an incident’s Slack channel and Bits AI provided you with a quick summary to get you up to speed. After some time has elapsed, however, you would like a new summary so you can get an updated snapshot of the incident. At any time, you can ask Bits AI to provide an ad hoc summary of a current incident by simply typing, “@Datadog, summarize incident 2091.”

Bits AI ties together alerts and related incidents based on the semantic similarities between alert tags and titles and any incident timeline events and discussions. By automatically contextualizing alerts with incident data, Bits AI is able to answer natural language prompts like:

  • “@Datadog, seems like dashboards are taking a long time to load. Is there an incident?”
  • “@Datadog, I am seeing alerts on increased lag on our checkout service. Is there a related incident?”

You can see Bits AI at work in the following screenshot. A high error alert including the tag service:events-intake was triggered, and when asked whether there are any related incidents, Bits AI searched for incidents on the events-intake service and returned the response shown below.

Bits AI automatically ties together related incidents

When you are responding to an incident, you may want to see whether any similar incidents occurred in the past and what was done to resolve them. Asking Bits AI to surface similar incidents and potential solutions can significantly reduce your MTTR by enabling you to find key information across your entire incident history without needing to click through each past incident.

Bits AI responds to queries about related incidents

Additionally, as you put together incident reports you can use Bits AI to search for incidents via queries such as, “@Datadog, how many incidents involved checkout failures in the last month?”

Bits AI responds to queries about past incidents

Get assistance performing incident management tasks

Finally, Bits AI can easily perform incident management tasks directly from Slack. For example, if you’d like Bits AI to update the severity of an incident, you can simply write, “@Datadog, update this incident to SEV-5.”

Bits AI performs incident management tasks

And if you are not in an incident’s channel, you can update its status from a different Slack channel by specifying its incident number: “@Datadog, mark incident 2091 as stable.”

Harness the power of generative AI to enhance incident management

Bits AI streamlines your troubleshooting by leveraging generative AI for real-time, easy-to-read insights into ongoing incidents. Bits AI for Datadog Incident Management is our first generally available generative AI feature and is part of a larger suite of incident management generative AI tools, such as AI-assisted automatic postmortem generation, which is currently in private beta.

To get started, head to Incident Settings and then to the Slack integration. There, toggle on “Activate Bits AI features in incident Slack channels for your organization,” within the Slack integration. And for the best Bits AI experience, enabling Slack Mirroring via the “Push Slack channel messages to the incident timeline” toggle is highly recommended.

If you aren’t already using Datadog, sign up today for a 14-day .