The Monitor

Create and monitor LLM experiments with Datadog

7 min read

Share article

Create and monitor LLM experiments with Datadog
Tom Sobolik

Tom Sobolik

Shri Subramanian

Shri Subramanian

Charles Jacquet

Charles Jacquet

Your LLM application needs a comprehensive testing and evaluation framework so you can effectively optimize it during pre-production. By running experiments, you can optimize prompts, fine-tune temperature and other key parameters, test complex agent architectures, and understand how your application may respond to atypical, complex, or adversarial inputs. However, it can be difficult to manage your experiment runs and aggregate the results for meaningful analysis. And without effective governance of your experiment datasets, it’s hard to be confident that your applications are being tested appropriately.

Datadog LLM Observability’s new Experiments feature enables you to trace experiment runs, curate and manage datasets, and analyze evaluations and telemetry across multiple experiments. And with the Experiments SDK, you can build, run, and automatically instrument experiments for monitoring. In this post, we’ll show how Experiments helps you build, analyze, and manage your pre-production LLM testing—all from the same unified environment within LLM Observability.

Build datasets for experimentation

Your experiments are only as good as the data you’re feeding them. It’s paramount to have clear visibility into the test data used for your experiments so you can ensure that this data is high-quality and annotated correctly, avoiding false evaluations. Creating test datasets can be tedious and error-prone—for instance, if you want to sample production data for your test datasets, you might find yourself meticulously copying and pasting LLM inputs and outputs from logs or traces into spreadsheets.

Datadog LLM Experiments solves this by letting you import data from your production traces in LLM Observability with a single click—or programmatically import or instantiate test datasets by using the LLM Experiments SDK. LLM Experiments also supports version control for datasets, so you can pull and push datasets to Datadog to manage them in the UI and easily share them across your team.

The Datasets view enables you to create, view, and manage your experiments’ datasets from the Datadog web UI. The dataset page shows all records, including each record’s input, output, and metadata. You can also manually add new records to the dataset from this view. For example, the following screenshot shows records from a dataset containing question-answer pairs for a personal finance application.

datasets-editor
datasets-editor

If you spot an issue with a record, you can click into it and edit all the fields directly within the web UI. To preserve your changes, collaborators can clone their own versions of your dataset and work on them separately. All previous versions of the dataset are stored, so you can troubleshoot problems with an experiment that used an older version of the dataset.

Monitor experiment runs and form insights

Datadog LLM Experiments offers a complete set of tools for creating and monitoring experiments and test datasets. The Experiments SDK lets you write and run experiments that are automatically traced by Datadog and annotated with test dataset records for monitoring with LLM Observability. Experiments consist of tasks and evaluators. Tasks define the business logic of experiments (determining how your application will run on the provided data), while evaluators are used to score and compare all the outputs produced by tasks. Tasks can be as simple as a single LLM call, or as complex as a multi-agent workflow—Datadog ingests and tracks all subtasks for monitoring via traces.

For instance, let’s say you’re experimenting with a RAG application. You could write a task that contains an LLM prompt and uses the RAG pipeline to generate additional context for the system prompt. You could then create evaluators to compute faithfulness, contextual precision, and contextual recall scores for the application’s RAG retriever. Finally, you can run the Experiment, which will evaluate the app’s output based on the three scores you have defined.

As soon as you run an experiment, traces of the run and details about the experiment become automatically available in Experiments. You can use Experiments to continually monitor evaluations and characterize application performance, comparing this data across multiple experiments, as well as troubleshoot issues with your experiments using traces.

Analyze experiment results to find optimizations

The Experiment Details page gathers telemetry about all of an experiment’s runs, including duration, errors, and evaluation scores. To understand how an experiment performed across all runs, you can view averages of its evaluations and performance metrics in the Summary section. Then, you can use the Evaluation Distribution to further hone in on a subset of experiment runs that had evaluations or metrics within a problematic threshold. This can help you find opportunities for optimizations.

compare-experiments-histo-3
compare-experiments-histo-3

For example, the preceding screenshot shows the Evaluation Distribution filters that highlights all experiment records with a high token count (indicating verbose and potentially costly inputs and outputs) and a suboptimal correctness_query score. By applying these filters, you can view a list of interesting records, examine the records’ inputs and outputs, and then drill into traces to investigate potential root causes of the highlighted performance issues.

You can inspect each of these records’ traces in the run side panel. This includes tool and task executions and each individual LLM call used to produce the final output. Continuing our previous example, the following screenshot shows a prompt execution trace for a run with an unusually high token count and low correctness_query score.

experiment-trace
experiment-trace

By using the trace tab to view each action taken by the application in an experiment run, you can more easily troubleshoot and interpret the application’s behavior and find ways to improve the output. In this example, we’re looking at the system prompt to see whether it included the correct context.

Compare models to find your application’s best fit

Datadog LLM Experiments enables you to aggregate and track experiments across multiple models so you can determine which one best suits your application’s tasks. The main Experiments view lets you filter and sort your experiments to quickly surface instances of poor evaluation scores, high duration, and other issues. As with the Experiment Details page, you can also use the Distribution Filter to hone in on a subset of experiments with evaluations and performance metrics within certain thresholds.

experiment-filters-2
experiment-filters-2

You can then select a group of experiments and compare their results to evaluate application performance across different prompt versions, models, application code versions, and more. For example, let’s say we have an experiment to test the conversational performance of a personal finance AI agent. We’ve run two experiments comparing Claude Sonnet 4 and GPT-4.1. The following screenshot shows a comparison of these runs.

compare-experiments
compare-experiments

We can quickly glean that the GPT-4.1 agent outperformed the Sonnet 4 agent on the accuracy_tool and correctness_query evaluations. We can also use this view to investigate inputs and outputs for individual experiment runs and spot outliers. For instance, we might find that the Claude application spectacularly outperformed GPT-4.1 on questions that invoke a specific task. We can then investigate further to either optimize this task for Claude or opt to accept the performance tradeoff to run the task with GPT.

Investigate model outputs to evaluate task performance

LLM Experiments gives you granular visibility into the output of every LLM call in your experiments. When you want to dive beyond evaluation scores and performance metrics to investigate how your application performs at a given task, you can examine generated outputs within experiments’ dataset records in the Compare page. This helps you understand LLM performance even in cases where it’s difficult to create effective evaluation metrics. For instance, let’s say you’re testing different models on an application designed to summarize news articles. You might see that both models are producing summaries containing the same key information, but using different phrasing and writing styles. By comparing outputs in the experiments’ dataset records, you can manually determine which model produces more natural-sounding, clear, and clean copy.

compare-outputs
compare-outputs

Get comprehensive visibility into your LLM experiments

LLM experimentation helps you refine model parameters, evaluate features, and better understand how your application might behave when confronted with different forms of input. With Datadog LLM Experiments, you can trace experiment runs, audit datasets, and analyze across multiple experiments so you can more effectively operate and manage pre-production testing for your LLM application.

Experiments is currently available in a limited preview. To sign up, see the corresponding page in our Product Preview site. And for more information about functional and operational LLM application performance monitoring in production with LLM Observability, see the LLM Observability documentation.

If you’re brand new to Datadog, sign up for a .

Related Articles

Monitor your OpenAI agents with Datadog LLM Observability

Monitor your OpenAI agents with Datadog LLM Observability

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Building an LLM evaluation framework: best practices

Building an LLM evaluation framework: best practices

Monitor your OpenAI LLM spend with cost insights from Datadog

Monitor your OpenAI LLM spend with cost insights from Datadog

Start monitoring your metrics in minutes