Business-critical infrastructure and services generate massive volumes of observability data from many disparate sources. It can be challenging to synthesize all this data to gain actionable insights for detecting and remediating issues—particularly in the heat of incident response. That’s why Datadog built Bits AI, a generative AI–powered DevOps copilot that can help you investigate and respond to incidents more efficiently across the Datadog web app, mobile app, and Slack, without switching contexts.
Bits AI provides a single, conversational interface that helps surface insights from throughout your environment by finding and correlating key data from across the Datadog platform, including Watchdog-detected log and trace anomalies, metrics, events, real-user transactions, Security Signals, and cloud costs. Bits AI can also help you resolve issues by suggesting automated code fixes, creating synthetic tests, and finding relevant Datadog workflows to trigger.
In this post, we’ll discuss how Bits AI can help you:
- Diagnose issues and determine their scope
- Investigate issues faster by surfacing key data
- Streamline incident response and remediation
- Prevent issues from reoccurring
When you detect an issue in your production environment, it can be challenging to quickly triage and investigate the problem. Bits AI enables you to use conversational language to quickly find answers to the questions you care about most, without switching tools or contexts.
Let’s say you’ve been paged in the middle of the night about a series of alerts firing for a service called
event-processor. You can open the Datadog mobile app and ask Bits AI for an assessment of the issue before even getting out of bed to sign on. To understand if you should escalate this into an incident, you want to check if any other dependencies are affected by the issue with
event-processor. The following screenshot shows how you could instruct Bits AI to find these correlations for you, without having to specify which dependencies to check on. Bits AI can also surface key insights, such as faulty deployments from Deployment Tracking, log or trace anomalies from Watchdog, and Security Signals. In this case, Bits AI notifies you about several ongoing issues occurring in an upstream service called
event-intake that correspond to the increased error rate for
event-processor. You can also see that there’s already an ongoing incident for
event-intake that may have spread to
At this point, you’ve confirmed that there’s a deep issue in your environment that demands your team’s attention. You can sign on to Slack and continue communicating with Bits AI directly in your team’s channels so that anyone else can easily jump in and join the conversation. You can also ask Bits AI to pull up assets like dashboards and internal documentation—including Confluence pages—so that your team can access helpful resources without having to manually look for them. For example, you can ask Bits AI, “Find me
event-processor’s service health dashboard” or even “Find me dashboards about Kubernetes” to pull in infrastructure health and performance data that might indicate the root cause of
Once you’ve detected and diagnosed an issue, you’ll want to dive deeper into your observability data to find the root cause and figure out how to direct remediation. Natural language queries in Bits AI can help you investigate faster and speed up your MTTR by enabling you to use conversational prompts to discover relevant metrics, traces, and logs, as well as security, infrastructure, and cloud cost data—all from one place. Bits AI understands how your organization has tagged your services and infrastructure and can translate your prompts into the correct syntax for querying all your data. This makes it easier for anyone in your organization to gather key information, even if they don’t have deep knowledge of the services involved. You can access the Bits AI chat window from anywhere in the Datadog web app or submit natural language queries in a host of key Datadog products, including Log Management and APM.
For example, let’s say your
event-processor service triggered an alert related to high average request latency. You can prompt Bits AI, “Show me traces from
event-processor that are slower than 1 second.” Bits AI will report a list of the queried traces, so you can quickly drill down into a flame graph and figure out which spans contain the bottleneck.
If traces show that requests to the upstream
event-intake service are causing a large latency bottleneck, you can investigate the dependency’s potential role in the problem by asking further questions, such as “How many errors did
event-intake have in the past three hours?” and “What was the average request latency for
event-intake starting yesterday at 9am?” Bits AI will also recommend useful follow-up questions based on the information it has provided over the course of your conversation. For example, Bits AI might suggest asking about any other detected issues for the
event-intake service, which could reveal ongoing Watchdog alerts that impact
During incident response, it’s essential to efficiently track and manage the process so that stakeholders can quickly access the most current information and context. But the administrative overhead of this work can be tedious and resource-intensive. By leveraging generative AI, Datadog can now perform many of these important tasks for you, so you can focus more on tackling complex issues in your apps and infrastructure. You can ask Bits AI to help you:
- Declare an incident in Datadog Incident Management
- Notify on-call team members via PagerDuty
- Update the severity of an incident
- Provide incident summaries
Incidents often progress quickly, making it hard for everyone to stay in the loop. Bits AI integrates seamlessly into your incident response Slack channel, so you can easily arm your team with the details they need to identify problems, determine their scope, and begin root cause analysis. When new responders join the incident Slack channel, Bits AI will automatically provide them with a summary of everything that has happened in the channel. They can also request a new summary as needed, or configure Bits AI to routinely post a summary at a set cadence.
To help your incident responders carry out remediation, Bits AI can surface key assets, such as Confluence runbooks and training guides. Bits AI can also suggest Datadog workflows to help you automatically fix issues. For example, if you discover that the
event-intake service has become unresponsive due to a DDoS attack, you can interact with Bits AI to kick off a workflow that will block the IPs that are flooding
event-intake with requests.
Bits AI doesn’t just help your incident responders investigate and remediate issues—it can also help your developers find and fix the code errors that led to them. For example, let’s say there’s an Error Tracking issue for a high volume of
NoneType errors in a Python script run by the
Datadog points you to the line of code where the error originated and provides a clear explanation of the error. It also analyzes executional context gathered from APM—including variable names and other state information, as well as additional source code surrounding the error—to provide an AI-generated test case and a fix you can deploy in your IDE. This feature saves you the time it would take to manually reproduce the error and finds a solution for you, so you can stay focused on addressing the more complicated problems in your application.
Once an incident is resolved, Bits AI can help you write the first draft of the postmortem, based on your team’s conversation in the incident Slack channel and the timeline in Incident Management. The generated postmortem will include:
- A summary of the system state at the time of the incident
- The customer impact of the incident
- The remediation actions that were taken
Responders can then iterate on the generated postmortem before finalizing the document.
To further improve your team’s posture for future incidents, you can also leverage Bits AI to create Synthetic tests that check for the problems you found. By writing text prompts, you can easily spin up API tests that ping a single endpoint and browser tests that step through user actions. These tests can help proactively validate the availability of services and endpoints, as well as the performance of key user journeys. For example, you can ask Bits AI to “help me create a test on shopist.io to test its availability and check if a user can log in successfully.” Datadog will respond by creating a browser test that pings that URL and steps through the login process. You can further customize this test as desired, for instance, by adding assertions for the desired page load speeds.
Bits AI can also automatically suggest these tests based on an analysis of your RUM performance. For example, you can ask, “What Synthetic tests should I create to cover the most popular user journeys in my app?" Bits AI will intelligently suggest tests that can help you proactively improve your user experience, and optionally create them for you. By expanding your testing footprint based on the insights your team gleaned from an incident, you can reduce the likelihood of future incidents and be better prepared for the next one.
With the power of generative AI, Datadog now enables you to use natural language prompts to surface intelligent insights from your observability data, generate key assets like tests and postmortems, and streamline your incident response and remediation. This new technology helps everyone in your organization better leverage your observability data and reduce context switching during investigations. By integrating Bits AI into your incident response workflows, you can empower disparate stakeholders to collaborate more effectively so they can resolve issues faster and help limit the scope of incidents—regardless of their level of experience or familiarity with your organization’s monitoring data.