A small SRE team protects Nulab's core platform with Datadog Bits AI SRE | Datadog
A small SRE team protects Nulab's core platform with Datadog Bits AI SRE

case study

A small SRE team protects Nulab's core platform with Datadog Bits AI SRE

About customer

Nulab is a collaboration software company behind tools like Backlog and Cacoo, supporting millions of users with scalable, cloud-based services across a rapidly growing, complex infrastructure.

Software
100+
Japan
“Bits AI SRE is not just automation. It fundamentally changes how we approach incident response and reliability as a team.”
case-studies/nulab/hisatomo-futahashi
“Bits AI SRE is not just automation. It fundamentally changes how we approach incident response and reliability as a team.”
Hisatomo Futahashi Principal Engineer Nulab

Why Datadog?

  • Enables autonomous investigations across logs, metrics, and traces
  • Accelerates incident triage with AI-driven root cause analysis
  • Surfaces deep insights without requiring prior system context
  • Supports natural language investigations across infrastructure and costs
  • Extends SRE capabilities to individuals and teams at any scale

Challenge

As Nulab’s systems scaled across services, teams, and cloud environments, maintaining reliability became increasingly complex, requiring faster investigations and reducing the growing cognitive load on a small SRE team.

Key Results

>30 min → 4 min

Incident analysis time reduced

1.7M+

Logs analyzed per minute at scale

High → Low

Operational burden reduced for a small SRE team

Reactive → Proactive

Transformed reliability posture

Sustaining reliability across a mature, enterprise-scale platform

Nulab has spent over 20 years building collaboration tools trusted by enterprise customers. Two prominent tools include Backlog and Cacoo. Backlog serves 15,000+ paid organizations and Cacoo supports 4M+ users.

At the core of Nulab’s ecosystem are two foundational platform services: Nulab Account, which handles unified authentication and billing across every Nulab product, and Nulab Pass, which enforces organization-wide security and access controls. Together they run on 50+ services, 1,500+ containers, 300+ hosts, and 35+ AWS accounts.

Despite this vast infrastructure, a small SRE team is responsible for the shared platform every Nulab service depends on. Hisatomo Futahashi, Principal Engineer at Nulab, handles monitoring, incident response, and platform health across the full stack. Given the team’s limited size relative to the platform’s scale, Nulab began looking for ways to multiply engineering capacity and extend what one person could do.

Nulab team

Accelerating investigations with AI-powered SRE

To address these challenges, Nulab adopted Bits AI SRE. Futahashi integrated Bits AI SRE into Nulab’s monitoring workflow. When alerts come in, engineers trigger Bits investigations from Slack. Bits became a force multiplier almost immediately, enabling faster and more consistent investigations.

During a real-world incident, a late-night DDoS attack was investigated and understood in just four minutes, even while processing massive volumes of telemetry. “Investigations often finish with just Bits now,” says Futahashi. “I don’t even open AWS Console or Terminal anymore.”

Bits analyzes logs, traces, and metrics together, mirroring how experienced SREs troubleshoot systems. It identifies patterns, surfaces anomalies, and provides clear conclusions without requiring deep prior context. This allows Nulab’s small SRE team to move from manual triage to guided, AI-assisted investigations, reducing cognitive load.

“Investigations often finish with just Bits now,” says Futahashi. “I don't even open AWS Console or Terminal anymore.”

Expanding SRE beyond incidents to everyday workflows

For a small team, every manual task carries real weight. Beyond critical incidents, Nulab uses Bits to handle the operational work that would otherwise consume that engineer’s day.

Many low-priority alerts that previously required at least 30 minutes on average to investigate can now be triaged with Bits in under 5 minutes. Latency investigations automatically incorporate historical context, helping catch recurring issues before they become regressions. “Bits gives us deep, precise insights without prior context,” says Futahashi. “Knowledge that used to exist only in engineers’ heads is now surfaced automatically.”

Engineers can now investigate logs and costs by chatting with Bits in natural language — no complex query writing, no custom dashboards. Work that once required deep specialist knowledge can now be offloaded to Bits, extending the team’s reach without adding headcount.

“Bits gives us deep, precise insights without prior context,” says Futahashi. “Knowledge that used to exist only in engineers' heads is now surfaced automatically.”

Building a new model for human and AI collaboration

For Nulab, adopting Bits AI SRE marks a fundamental shift in how incident response is approached. “Bits protects the ’now’, I protect the ‘future’,” says Futahashi.

By offloading real-time investigation work to AI, the team can focus on improving systems, refining processes, and driving long-term reliability. At the same time, Bits continuously learns from data, context, and usage, creating a feedback loop that improves reliability over time.

Nulab treats Bits as a member of the team, investing in better telemetry, stronger context, and best practices to maximize its effectiveness.

Turning AI-driven SRE into a competitive advantage

Today, Nulab has changed how the team handles incident response from a reactive, manual process into a faster, more scalable, and more intelligent workflow. Investigations that once required deep expertise and significant time can now be completed in minutes. Engineers operate with greater confidence, reduced cognitive load, and improved visibility across complex systems. “Bits AI SRE is not just automation. It fundamentally changes how we approach incident response and reliability as a team,” says Futahashi.

By combining Datadog observability with AI-driven SRE, Nulab is building a more resilient platform while enabling teams to move faster and focus on what matters most: delivering reliable, high-quality experiences to its users.

Resources

og/default/og-press-release

press release

Datadog Unveils Latest AI Agents to Rapidly Resolve Application Issues
Introducing Bits AI SRE, your AI on-call teammate

BLOG

Introducing Bits AI SRE, your AI on-call teammate