Annotate traces to improve LLM quality with Datadog LLM Observability

Rashel Hoover

Will Potts

LLM applications rarely crash. They degrade quietly. Once these applications are shipped to production, subtle quality failures become harder to catch with traditional signals. Tone shifts, hallucinated details, off-topic responses, and incomplete reasoning can emerge while latency and token usage look stable.

To help you review and improve LLM quality at scale, Datadog LLM Observability now includes Automations and Annotation Queues. Automations route production traces to datasets or annotation queues based on configurable rules and sampling strategies. Annotation Queues provide a structured environment for systematic human review of curated traces. Domain experts can apply structured labels and qualitative feedback while viewing the full trace context, including spans, metadata, and evaluation results. Together, these features support a quality improvement workflow that includes issue detection, trace routing and review, and model refinement.

In this post, we’ll show how you can use LLM Observability to:

Route production traces to datasets and annotation queues automatically
Review LLM traces in context and apply consistent labels
Use annotations to fuel a quality improvement loop

Route production traces to datasets and annotation queues automatically

Manual quality workflows do not scale. Teams often browse traces by hand, export examples to spreadsheets, and share screenshots in Slack. That process loses context, produces inconsistent labeling, and creates static datasets that drift away from production behavior.

Automations help you keep evaluation and review workflows tied to real traffic. You can define rules that detect traces that need attention and route them automatically to a dataset for evaluation or to an annotation queue for human review. Configurable sampling rates control how many traces are routed, helping you focus on high-signal requests without overwhelming reviewers.

Automation configuration screen that defines rules, filters, and sampling rate to route failed evaluation traces to an annotation queue.

Automations can route traces based on signals such as:

Evaluation failures or low evaluation scores (for example, score < 0.5)
User feedback such as a thumbs-down rating or a negative sentiment
Error status and retries
Latency thresholds
Sampling rates (for example, capture 10% of matching traces)

With routing in place, your datasets can automatically evolve with production traffic, and your queues stay populated with the traces that matter most.

Review LLM traces in context and apply consistent labels

Once traces are routed, Annotation Queues give your teams a purpose-built workspace for systematic human review inside Datadog. Reviewers can evaluate traces with full context rather than judging a stand-alone completion in a spreadsheet. In Review mode, domain experts can inspect each trace alongside the details that influence quality, including prompts and completions, spans, metadata, tool calls, inputs and outputs, and evaluation results.

Annotation Queues review screen that displays full trace context, including spans, input and output, evaluation results, and structured failure labels with pass/fail assessment.

Annotation Queues enable you to create shared labeling schemas so that reviewers evaluate traces against the same criteria and produce feedback that can be reliably compared and analyzed. Reviewers can apply:

Pass/fail ratings
Categorical labels with predefined values
Numeric scores with defined ranges
Boolean flags for specific attributes
Free-form notes for nuanced feedback

As reviewers label traces with the same schema, patterns become easier to spot and reuse across investigations. Teams commonly converge on failure modes such as irrelevant responses, hallucinated references, tone mismatch, and overly verbose answers.

Use annotations to fuel a quality improvement loop

After traces are consistently labeled, the resulting data becomes actionable across your entire quality workflow. Automations and Annotation Queues help you use that data as part of a repeatable improvement loop to:

Calibrate automated evaluators: Human annotations are the ground truth that makes LLM-as-judge evaluations trustworthy. You can use labeled traces to measure whether your automated evaluators are flagging the right things and to correct them when they’re off target. A focused set of 100–200 annotated traces is often enough to meaningfully improve evaluator accuracy across all future production traffic.
Build and maintain golden datasets: Annotated traces from production are more valuable than synthetic examples because they reflect real user behavior, edge cases, and failure modes that your team has already reviewed and understood. You can export labeled traces to seed test datasets and golden datasets that stay aligned with how your application behaves.
Compare model versions and prompt iterations: When you ship a new prompt or upgrade your model, annotated baselines give you a reference point to measure against. Instead of asking whether things got better in the abstract, you can ask whether specific failure patterns decreased.
Track failure patterns over time: Consistently labeled failure modes become measurable signals. Teams can track how these patterns change across model versions, user segments, and feature releases, turning qualitative observations into metrics that guide ongoing improvements.

Dataset export screen that shows options to set expected output and add a trace to an existing dataset for evaluation.

Start using Automations and Annotation Queues to improve LLM quality

Automations and Annotation Queues enable you to scale online LLM evaluations across live production traffic while keeping humans in the loop where judgment matters most. You can route problematic traces to domain experts for systematic review, validate evaluations with human ground truth, and build golden datasets from production examples without losing critical context. This balance of automation and expert oversight helps you improve LLM quality as your applications evolve. To learn more, check out the Annotation Queues documentation.

If you don’t already have a Datadog account, you can sign up for a 14-day free trial to get started.

Annotate traces to improve LLM quality with Datadog LLM Observability

Route production traces to datasets and annotation queues automatically

Review LLM traces in context and apply consistent labels

Use annotations to fuel a quality improvement loop

Start using Automations and Annotation Queues to improve LLM quality

Related Articles

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

Datadog achieves ISO 42001 certification for responsible AI

Evaluating our AI Guard application to improve quality and control cost

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

Start monitoring your metrics in minutes

Get Started with Datadog

Route production traces to datasets and annotation queues automatically

Review LLM traces in context and apply consistent labels

Use annotations to fuel a quality improvement loop

Start using Automations and Annotation Queues to improve LLM quality

Related Articles

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

Datadog achieves ISO 42001 certification for responsible AI

Evaluating our AI Guard application to improve quality and control cost

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes