
Kai Xin Tai
Getting paged pulls engineers away from meaningful work, and in many organizations, the on-call response process remains manual and draining. A monitor fires, and teams scramble to identify the root cause. The process is reactive, reliant on siloed knowledge concentrated in a few experts, and often lacks sufficient context. The rise of coding agents is only making it harder, as more code is being shipped faster with less human oversight. That means greater production complexity and more alerts that are increasingly difficult to troubleshoot. Traditional incident response simply doesn’t scale in this new world.
Today, we're introducing Bits AI SRE—an autonomous AI teammate that investigates alerts, coordinates incidents, and learns from every response. Bits acts like an engineer on your team: it sees the same telemetry, understands the context of your systems, and can generate and test exponentially more hypotheses in parallel than a human. All of this happens automatically—no prompting required. By the time you’ve opened your laptop, Bits has often already surfaced a likely root cause. If escalation is needed, Bits manages incident response end to end: posting status updates, flagging related alerts, recommending next steps, and supporting post-incident reviews.
In this post, we’ll cover how the Bits AI SRE agentic model of incident response helps you scale your team and enable your engineers to spend more time building and shipping great software.
Autonomous alert investigation
Let’s take a look at how Bits AI SRE works in practice. When an alert fires, Bits immediately launches an investigation. It begins by gathering context: reading the monitor message, checking linked Confluence runbooks, referencing past investigations of the same monitor, and running exploratory telemetry queries. By default, these findings are written to the Monitor status page and the Bits Investigations page. If configured, Bits can also post directly to Datadog On-Call, Slack, or your preferred ticketing system, like Case Management, Jira, or SNOW using the @oncall
, @slack
, or @case
handles in the monitor definition, which ensures the right teams are notified through the right channels without added manual effort.
Based on what it learns, Bits dynamically generates multiple root cause hypotheses and begins testing them by querying data across your environment and reasoning through the results. Like an engineer scanning dashboards, analyzing logs and traces, or checking recent changes and Watchdog alerts, Bits performs these tasks using purpose-built tools. At each step, it decides which tool to call and whether it needs more information.

It methodically invalidates hypotheses without supporting evidence and digs deeper into promising leads. Each hypothesis is classified as validated, invalidated, or inconclusive, enabling teams to quickly see what’s confirmed, what’s ruled out, and where further investigation is needed. What once took more than 30 minutes of manual triage now happens automatically, often resulting in a confident diagnosis before you've even opened a laptop.
But, along with speed, this shift is about autonomy and continuous improvement. As a reasoning agent, Bits doesn't start from scratch every time. It draws on memory from past alerts to recognize patterns and accelerate investigations. If the issue has occurred before, Bits remembers what worked and what didn’t. You can correct Bits when it makes a mistake or reinforce the right diagnosis when it gets it right. This means that every alert is a learning opportunity, making the next response faster, smarter, and better aligned with how your team works.

Bits also supports collaboration and transparency, delivering rich, structured insights directly into shared workflows. When Bits confidently determines a likely root cause, it posts it directly in the Slack thread. These messages include links to relevant data, making it easy for engineers to review, audit, or build on the analysis. These messages aren’t just status updates—they’re starting points for action and coordination. While this example focuses on Slack, Bits also integrates with Datadog On-Call and Case Management, supporting bidirectional syncing with tools like ServiceNow and Jira—ensuring seamless integration with your existing processes.

Just like a teammate, you can ask Bits follow-up questions, get next-step recommendations, or query service health metrics or ownership information. Importantly, Bits scales with your team and operational load. It works 24/7, and delivers consistent, high-quality analysis across all severities and times of day—whether it’s a major production outage at 4 a.m. or a low-priority anomaly during office hours.
Incident coordination
When alerts escalate into incidents, timely coordination is critical. Along with alert investigation, Bits helps teams stay on top of these high-stakes incidents.
Deliver clarity in chaos with real-time incident summaries and stakeholder updates
Responders who join mid-incident often have to parse through Slack channels with hundreds of messages to piece together what’s happened, what’s been attempted, and where things stand. This information overload creates delays, miscommunication, and longer time to resolution. Bits automatically generates real-time incident summaries with key details like nature, impact, contributing factors, and actions taken. You can also request an on-demand update at any time by messaging “@Datadog, summarize this incident.”
Within Datadog, teams can define custom message templates with dynamic AI-generated fields and then pair them with notification rules to automatically send updates via Slack, Microsoft Teams, email, Datadog On-Call and other platforms. This ensures that key stakeholders like executives receive timely and relevant updates throughout the incident lifecycle without adding manual work to already busy teams. Additionally, you can also ask Bits to draft a Datadog Status Page update to keep customers informed on the progress of the incident.
Proactive detection of related incidents
Recognizing related incidents is often the key to faster resolution. Bits automatically detects when new incidents are declared within 20 minutes of one another and proactively flags potential connections. This helps teams identify whether they’re dealing with a local issue or symptoms of a broader outage and avoid duplicate investigations.

Capture follow-up tasks and generate a postmortem
Once an incident is resolved, Bits will automatically post a final summary visible to everyone in the channel—ensuring a shared understanding of how the issue was addressed. It also identifies any follow-up tasks mentioned during the incident and prompts users to review and formalize them. These tasks are saved directly in the incident’s Remediation tab in Datadog.

When it’s time to document the incident, Bits can help kick things off with a first draft of the incident postmortem that responders can refine and share for review. For organizations with custom reporting requirements, postmortem templates can be configured to include AI variables—such as customer impact, system context, and lessons learned. This reduces time spent compiling information so teams can focus on the deeper analysis that drives improvement. Lastly, as you're reviewing your operational burden as part of your weekly incident review, you can use Bits to analyze trends such as, "@Datadog, how many incidents involved checkout failures in the last month?"
Bits AI SRE reimagines the way you run operations
Bits AI SRE introduces a new way to operate production systems, where AI moves beyond simple chat prompts to drive complex root cause investigations and coordinate incident response. As a fully autonomous teammate, Bits understands your systems, analyzes telemetry in real time, applies structured reasoning, and learns from every interaction. It doesn’t just react—it thinks, adapts, and improves with every response.
Get started with Bits AI SRE today. If you don’t already have a Datadog account, sign up for a 14-day free trial.