Track, compare, and optimize your LLM prompts with Datadog LLM Observability

Jacob Simpher

Yahya Mouman

Barry Eom

Will Potts

LLM agents and LLM applications depend on continuous experimentation to refine their performance and accuracy. But many teams lack a systematic way to manage and observe these changes. Prompts are often hardcoded within applications and tracked through ad hoc methods, creating a disconnect between prompt iterations and key observability data. As a result, it’s difficult for you to know whether a change to your prompt improved your application or caused regressions in latency, cost, or other evals.

Prompt Tracking, a new feature that is generally available in Datadog LLM Observability, helps fill this visibility gap. Prompts are defined, versioned, and monitored as first-class artifacts, so you can manage and evaluate them with the same rigor that you apply to application code and models. You can correlate performance regressions or improvements directly with specific prompt changes; measure error rate, token usage, and latency; and make confident rollout decisions based on real data.

In this post, we’ll show how Prompt Tracking helps you:

Define and observe changes in prompts
Correlate prompt changes with performance
Analyze all prompt versions and trends in one view

Define and observe changes in prompts

Prompt Tracking introduces versioned, observable prompts within LLM Observability. Each prompt is defined with structured metadata, such as its name, version, template roles (for example, system, user, and assistant), and dynamic variables (for example, ${user_name} and ${ticket_id}). If your application uses the LangChain framework, prompt IDs and versions are automatically instrumented through the LLM Observability SDK.

You can manage prompt definitions directly through the LLM Observability SDK and API. Python support is available now, with Node.js and Java support coming soon. Prompts are stored and versioned across environments, giving you a clear record of which version is deployed in dev, staging, and production.

When a tracked prompt is called, LLM Observability automatically attaches the corresponding version metadata to your LLM spans and traces. This metadata association gives you an immediate, auditable link between each prompt change, its deployment, and the observed performance. As a result, you can see how every iteration affects application outcomes. This versioned approach brings prompt development closer to established software practices, giving your engineers a feedback loop as tight as a CI/CD pipeline for traditional code.

A trace view showing span details for the OpenAI.createResponse step, including the prompt template, variables, and associated metadata.

Correlate prompt changes with performance

When prompts are versioned and instrumented, LLM Observability provides correlation between prompt evolution and LLM performance metrics. In the Traces view, you can filter by prompt name and version to see how different iterations behave across requests. Each trace displays the full prompt template, the prompt variables, and associated telemetry data, including latency, token count, and evaluation metrics.

For example, if your team releases a refinement prompt (v2.0) to shorten responses, you can track how that change affects latency, errors, and token usage compared to the previous version (v1.2). If costs decrease but error rate increases significantly, you’ll know why. At that point, you can decide whether to refine the prompt or roll back to a previous version.

A prompt trend panel that shows prompt version calls in addition to average token usage and P95 latency for each version.

You can also replay a version of a prompt in the Prompt Playground to validate changes before production rollout. This environment offers quick, side-by-side testing of prompt behavior and model responses, shortening the iteration cycle between design and deployment of prompts.

Analyze all prompt versions and trends in one view

The Prompts section of the Overview tab consolidates all of your tracked prompts into a single analytics-driven dashboard. Each entry shows a prompt’s versions, usage trends, and key performance indicators such as latency, token cost, and evaluation results. You can view timeseries trends by environment, model, or LLM provider to understand how prompts are performing across real workloads.

The Prompts section within the LLM Observability Overview tab, showing recent prompt updates and corresponding operational metrics such as calls, latency, and token use.

From this view, your platform and AI engineering teams can identify underperforming prompts, detect regressions before they reach users, and prioritize optimization work. For example, a surge in latency for one version might indicate inefficient template changes, whereas higher token counts could signal prompt bloat. Because Prompt Tracking is integrated with Datadog’s unified telemetry data (metrics, logs, traces, and evaluations), you can investigate performance regressions for prompts in context with infrastructure health or external API behavior.

Start tracking and optimizing your LLM prompts today

LLM Observability offers your teams full visibility into how changes in prompts influence the performance, reliability, and cost of their LLM applications. With clear versioning, unified observability, and performance correlation, Prompt Tracking enables data-driven iteration, eliminates guesswork, and reduces the risk of regressions. To learn more, check out the Prompt Tracking documentation.

If you don’t already have a Datadog account, you can sign up for a 14-day free trial to get started.

Track, compare, and optimize your LLM prompts with Datadog LLM Observability

Define and observe changes in prompts

Correlate prompt changes with performance

Analyze all prompt versions and trends in one view

Start tracking and optimizing your LLM prompts today

Related Articles

Gain end-to-end visibility into MCP clients with Datadog LLM Observability

Define, run, and scale custom LLM-as-a-judge evaluations in Datadog

Evaluating our AI Guard application to improve quality and control cost

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

Start monitoring your metrics in minutes

Get Started with Datadog

Define and observe changes in prompts

Correlate prompt changes with performance

Analyze all prompt versions and trends in one view

Start tracking and optimizing your LLM prompts today

Related Articles

Gain end-to-end visibility into MCP clients with Datadog LLM Observability

Define, run, and scale custom LLM-as-a-judge evaluations in Datadog

Evaluating our AI Guard application to improve quality and control cost

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes