Baz is building agents for the autonomous codebase with Datadog LLM Observability

Building the operating layer for the autonomous codebase

For Baz, observability is not just for debugging incidents. It is how the team evaluates agents, identifies failure modes, validates guardrails, and improves autonomous behavior over time. Unlike traditional static analysis tools or surface-level AI harnesses, Baz focuses on behavior-aware analysis that understands how code actually runs and whether it is safe to ship.

The platform processes around one million LLM-powered operations every day across code review, issue detection, and automated fixes, all running on Amazon Web Services (AWS). These agents are not just generating suggestions. They are actively shaping production code, which raises the bar for correctness, reliability, and trust. “Our goal is to make our autonomous agents safe to run in production,” says Nimrod Kor, CTO and Co-founder of Baz. “That means every decision has to be transparent, explainable, and grounded in real context.”

To support these requirements, Baz built observability into the core of its platform. The team correlates LLM decisions with runtime traces, logs, and source code, enriching traces with Git metadata so every agent action can be mapped to a specific commit and execution path.

The visibility gap in agent-driven systems

Before adopting Datadog LLM Observability, Baz had strong telemetry across logs and traces, but no unified way to understand how agents reasoned through decisions. When an agent produced an incorrect recommendation, engineers could not easily determine whether the issue came from the model, the input, or a downstream system. “We had the data, but it was fragmented,” says Kor. “There was no single place where we could connect LLM inputs and outputs with traces, logs, and the code that actually ran.”

This lack of visibility made it difficult to confidently scale autonomous workflows. Debugging required stitching together multiple signals, which slowed iteration and increased the cost of deploying agent-driven automation.

“We had the data, but it was fragmented. There was no single place where we could connect LLM inputs and outputs with traces, logs, and the code that actually ran.”

Unifying LLM Observability with application context

With Datadog LLM Observability, Baz created a single view of agent behavior across its entire stack. “We can see the full decision surface of every agent,” Kor explains. “Inputs, outputs, annotations, traces, and code metadata are all connected, so we can immediately understand what happened.”

Baz also uses LLM Observability annotations and the annotation queue to bring human-in-the-loop validation directly into its workflows. Engineers can review, label, and evaluate agent outputs in context, creating a continuous feedback loop that improves agent performance over time. “With LLM Observability annotations, we reduced false positive recommendations and increased safe automatic fixer triggers,” says Kor.

By connecting LLM Observability with Application Performance Monitoring (APM) and Log Management, Baz can trace an issue from an agent decision through every layer of execution without switching tools. With APM, engineers can follow an agent’s output into downstream services, quickly identifying latency issues, failed dependencies, or misbehaving code paths tied to a specific trace and commit. With logs, they can inspect the exact inputs, outputs, and system events surrounding that decision, adding critical context to understand why it happened. This unified view eliminates the need to manually stitch together signals across systems. Engineers can move directly from a problematic output to root cause, reducing investigation time and enabling faster, more confident fixes.

In one case, an agent generated a code change that appeared to reference unknown data, raising concerns about a potential cross-customer issue. By inspecting the LLM trace and annotations, the team discovered the input was empty and the model had generated a hallucinated response. “We could see exactly what the model received and why it behaved that way,” says Kor. “That allowed us to add input validation and eliminate that entire class of failures.”

In another case, the team encountered a production issue where an agent entered a degenerate output loop, generating long and broken responses. LLM Observability revealed that the issue correlated with a model downgrade, missing token limits, and a lack of separation between reasoning and output. “We were able to trace the issue to specific configuration changes and runtime conditions,” Kor explains. “We reverted the model, added guardrails, and introduced a hidden reasoning field to stabilize outputs.”

Driving faster iteration and reliable automation at scale

Today, Baz operates with end-to-end LLM trace coverage across its platform. Every request can be followed from user interaction through agent decision, with runtime traces, logs, and code context unified in a single system. “Datadog gives us a single place to see model decisions, runtime behavior, and code context together,” says Kor. “That is what makes it possible to run autonomous agents with confidence.”

This changed how Baz ships. Instead of treating agent failures as isolated bugs, the team can turn them into reusable safety cases: known scenarios that inform prompts, validation rules, rollout guardrails, and pre-deployment testing. Observability became part of Baz’s deployment discipline, not just its debugging workflow.

This visibility has a direct impact on how the team builds and ships. Engineers can go from an unexpected output to root cause in a single workflow, reducing investigation time and allowing more findings to be resolved during on-call shifts. The team also captures representative LLM traces to build datasets of real-world scenarios, using them to test and validate changes before deployment. By running experiments against known failure cases, Baz ensures regressions are resolved and ships more reliable agents from the start. “Developers can get fully onboarded in less than two days and immediately understand how their features behave in production,” Kor notes.

“We are building towards a world where every coding decision is observable and actionable. That is what allows us to scale AI safely.”

Beyond LLM Observability, Baz uses Datadog Application Performance Monitoring, Log Management, and Continuous Profiler to monitor system behavior, while dashboards and monitors provide real-time visibility into platform health. Real User Monitoring (RUM) helps the team detect frontend issues and connect user interactions to downstream agent activity, enabling end-to-end visibility from user sessions to agent decisions and backend services.

Custom tagging on LLM spans allows Baz to track usage and cost at a granular level, including per-organization analysis. This helps the team identify which customers or workflows drive inefficient usage, detect abusive or bot-driven activity, and improve margins. “We can attribute cost and behavior down to the individual workflow,” says Kor. “That helps us operate efficiently while scaling.”

Custom dashboards and metrics have reduced the time required to evaluate product changes by up to 80%, eliminating hours of manual analysis and allowing the team to move faster with confidence.

Baz is using observability to build the autonomous software lifecycle: observe agent behavior in production, identify failure patterns, evaluate changes against real scenarios, and deploy safer automation back into the codebase. For Baz, observability is not the end of the workflow. It is the control loop that makes autonomous software systems possible. “We are building towards a world where every coding decision is observable and actionable,” says Kor. “That is what allows us to scale AI safely.”