How we made a SQL query optimization agent 59% more accurate using autoresearch and Agent Observability

Charles Jacquet

Product Manager

Zhengda Lu

Software Engineer

Thomas Sobolik

Senior Technical Content Writer

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

The Datadog team responsible for building and maintaining Database Monitoring (DBM) needed to tackle these challenges in order to explore whether an AI agent could augment DBM’s automated query optimization recommendations. The DBM team used Karpathy’s autoresearch tool to trigger 23 autonomous experiments that brought the query optimization recommendation agent from precision scores of P=0.54 to P=0.86 overnight. Through this iterative process, the team proceeded through three phases:

Optimizing the prompt and tool chain
Rightsizing the model for an appropriate cost-performance tradeoff
Breaking the LLM call into two separate passes to break through a final performance barrier

In this post, we’ll discuss the autoresearch-powered experimentation process in depth, exploring how the team planned and executed rapid iteration of the agent by using Agent Observability Experiments to track, analyze, and act on the experiment results.

Augmenting DBM’s query optimization recommender with agentic AI

DBM’s query optimization recommender currently uses a multi-source heuristic engine (written in Go) that combines SQL parse-tree analysis, real explain plans, schema metadata, and runtime metrics to detect optimization opportunities. It covers six pattern families:

Missing index detection with plan-flip analysis (detects when the planner alternates between strategies)
SELECT * expansion with schema-aware column enumeration
ORDER BY without LIMIT with metrics-based row-count thresholds
OFFSET without ORDER BY (pagination correctness)
Idle-in-transaction detection via activity event analysis
Comprehensive SQL rewrite rules (OR to ANY, CAST normalization, date-to-range, CTE filter pushdown, and more)

This engine is precise by design. Each pattern is validated against actual database context. Explain plans use relative cost filtering. Metrics-based scoring avoids false positives on small result sets. On our evaluation dataset, it achieves a precision score P=0.903.

The DBM team wanted to see if an AI agent could run after the heuristic engine to discover additional optimization patterns. They hypothesized that an agent could discover types of patterns that are harder to encode as individual heuristic rules because they require cross-referencing multiple signals or understanding subtle semantic tradeoffs. For example:

A sequential scan on an indexed column might mean stale statistics (needs ANALYZE), not a missing index.
A covering index exists but is not being used as an index-only scan, suggesting a stale visibility map (needs VACUUM).
An expensive aggregation query running 15,000 times could benefit from a materialized view.

These kinds of rules require reasoning that combines schema knowledge, plan analysis, and performance judgment. They are difficult to express as individual rules but could be more natural for an AI agent that can see all the signals together.

The team began testing this hypothesis by feeding an LLM a set of queries with a simple zero-shot prompt: no domain rules, just “analyze this SQL.” It surfaced many more patterns (a recall score R=0.90), but nearly half the suggestions were wrong (P=0.54).

System	Precision	Recall
Heuristic engine	0.903	0.633
AI agent (zero-shot)	0.543	0.898

In other words, the heuristic engine was more precise at finding valid optimizations, but the LLM could find a broader set of potential optimizations. In order for the agentic solution to be practical, the team had to figure out if they could teach the agent to be more precise while preserving this greater breadth. Next, we’ll discuss how the team answered this question by creating a rigorous evaluation dataset and an experiment infrastructure that enables fast iteration.

Building the experiment

In this section, we’ll discuss how the team created the data, evaluators, and experiment infrastructure they used to iterate their SQL optimization agent.

The dataset

To build the evaluation dataset, the team created 100 cases across five types: rewrites, missing indexes, anti-patterns, maintenance, and schema changes. Each case includes the SQL query and the telemetry the agent would see in production: schema, explain plans, metrics, and transaction stats. Of these, 30% are negative cases (queries that need no optimization).

The DBM team created these test cases programmatically using the Agent Observability SDK, as shown in the following code snippet:

1
from ddtrace.llmobs import LLMObs
2

3
LLMObs.enable(site="datadoghq.com", api_key="...", project_name="query-optimization")
4

5
records = [
6
    {
7
        "input": {
8
            "sql": "SELECT id, user_id FROM sessions WHERE status = 'expired'",
9
            "telemetry": {
10
                "schema": {"tables": {"sessions": {
11
                    "columns": [{"name": "id", ...}, {"name": "status", ...}],
12
                    "indexes": [{"name": "idx_status", "definition": "CREATE INDEX ... (status)"}]
13
                }}},
14
                "events": {"explain_plans": [{
15
                    "definition": {"Plan": {"Node Type": "Seq Scan", "Total Cost": 35000,
16
                                            "Plan Rows": 100, "Rows Removed by Filter": 99900}}
17
                }]},
18
            },
19
        },
20
        "expected_output": {
21
            "optimizations": [{"type": "Maintenance", "match_key": "maintenance:analyze:sessions"}]
22
        },
23
        "metadata": {"case_id": "E08", "category": "plan_analysis", "difficulty": "hard"},
24
    },
25
    # ... 99 more cases across 11 optimization types
26
]
27

28
dataset = LLMObs.create_dataset(dataset_name="pg-optimization-v1", records=records)

The evaluators

Once the dataset was in place, the team configured evaluators to measure the agent’s performance. These included precision, recall, and F1 scores. This way, they could compare the precision-recall tradeoff achieved in each agent iteration with a single heuristic marker (F1), as well as compare precision and recall scores across experiments. The following screenshot shows how these evaluators are displayed for each experiment run in Agent Observability Experiments.

Screenshot of the LLM Observability Experiments list view showing 20 of 23 experiment runs. Each row displays a status badge, experiment name, Judge F1, Judge Precision, Judge Recall scores, dataset name, time since last run, and the experimenter. Experiment names include “blind-haiku-twopass,” “twopass-cross-model-verify,” “twopass-surgical-verifier,” and others. The top result, “blind-haiku-twopass,” shows the highest F1 of 0.803 with precision 0.860 and recall 0.823.

The autoresearch infrastructure

Karpathy’s autoresearch is a setup where you give an AI agent a small but real LLM training codebase and let it experiment autonomously overnight. The agent modifies train.py, trains for five minutes, checks if the result improved, keeps it or discards it, and repeats. You wake up in the morning to a log of experiments and a better agent.

The design is deliberately simple:

One GPU
One file the agent edits (train.py)
One metric (validation bits per byte)
One file the human edits (program.md, the instructions that define the research direction).

The key idea is that humans are not designing individual experiments. The team sets parameters for the research by writing program.md, and the agent does the rest: proposing changes, running experiments, evaluating results, and deciding what to try next. The agent runs about 12 experiments per hour—roughly 100 overnight.

While autoresearch is designed to optimize model training, the DBM team wanted to apply the same methodology to AI agent development, where the “weights” being tuned are prompts, skills, and tools rather than neural network parameters. The DBM team adapted Karpathy’s tool to iterate the SQL optimization agent; 23 experiments produced 17 kept improvements.

In this case, the configured evaluators form the objective function that the autoresearch agent loop optimizes against. First, the team set a concrete target for this function: P>=0.85, R>=0.85 on a small model. Then, they set a fixed time budget of 15 minutes for each experiment run. Finally, they defined the intended agent behavior in a HANDOFF.md document. This document defines the current state, the error analysis, and the next hypotheses. A coding agent running in the autoresearch environment reads the handoff, designs experiments, runs them via Agent Observability Experiments, analyzes per-case failures, and writes the updated handoff for the next session.

Experiment code for one of these autoresearch runs is shown in the following snippet:

1
def optimization_task(input_data, config=None):
2
    """Your agent, wrapped as an experiment task."""
3
    return run_optimization(
4
        sql=input_data["sql"],
5
        telemetry=input_data["telemetry"],
6
        model=config.get("model", "anthropic/claude-haiku-4-5"),
7
    )
8

9
experiment = LLMObs.experiment(
10
    name="haiku-self-verify",
11
    task=optimization_task,
12
    dataset=dataset,
13
    evaluators=[judge_precision, judge_recall, judge_f1],
14
    config={
15
        "model": "claude-haiku-4-5",
16
        "prompt_version": "v20h",
17
        "phase": "distillation",
18
        "goal": "Add self-verification step to boost precision",
19
        "expectation": "+2pp P from double-checking before output"
20
    },
21
    description="Self-verification: model reviews each suggestion against evidence before including it.",
22
)
23

24
result = experiment.run(jobs=10)

Each experiment is tagged with the hypothesis (goal), the prediction (expectation), and the research phase. Agent Observability Experiments records all of this as structured metadata alongside the per-case results and agent traces. When the automated driver analyzes failures in the next iteration, this metadata is what it reads to decide what to try next.

Running the experiment

The experiment ran in two phases of eight experiments each: first, optimizing the agent’s system prompt, tool descriptions, and worked examples on a large model, and then finding the best way to compress to a smaller model while retaining the desired evaluation targets. The first two phases produced a result just beneath the target precision score of 0.85. A third phase ran seven more experiments to implement a two-pass solution that finally reached the team’s target. In this section, we’ll discuss how the agent was iterated through each of these phases.

Phase 1: Prompt and tool iteration on a large model

In Phase 1, the autoresearch loop ran eight experiments on Claude Sonnet 4.6, starting with a zero-shot prompt (P=0.543, R=0.898) and iterating across three levers: the system prompt, the tool descriptions, and the worked examples.

The agent used seven tools that mirror production telemetry APIs: get_table_schema, get_explain_plans, get_query_metrics, get_idle_in_transaction_stats, and others. Early experiments focused on how the prompt instructs the agent to use these tools and interpret their output.

These runs produced three key turning points:

Structured output and evidence rules pushed precision from 0.54 to 0.83 across the first few experiments. Requiring the agent to cite tool evidence (explain plan costs, schema indexes) before suggesting optimizations eliminated most hallucinations.
Relaxing rules regressed. One experiment loosened missing-index co-occurrence rules, hoping to recover recall. Both precision and recall dipped.
Worked examples broke through. Adding three examples of what not to optimize (high-selectivity scans, subqueries with OFFSET, stale statistics) pushed precision to 0.878 while holding recall at 0.858.

After these iterations, blind evaluation on 50 more unseen cases confirmed no overfitting: P=0.870, R=0.830, as shown in the following screenshot:

Screenshot of the LLM Observability Experiments Compare view showing a side-by-side comparison of two experiments: “sonnet-zero-shot” (baseline) and “sonnet-worked-examples” (variant). A results table shows the variant reduced average duration from 22.5s to 15.7s (38% faster), improved F1 from 0.55 to 0.776 (41.2% increase), improved precision from 0.543 to 0.878 (61.4% increase), and slightly reduced recall from 0.898 to 0.858 (4.5% decrease), across 108 experiment runs.

This view in Agent Observability Experiments lets you compare two experiments side by side. Here, we compare the initial zero-shot starting point against the final result of Phase 1. The precision gain is clear: The Phase 1 version’s precision was 61.6% higher.

The team could also review full traces of this experiment run within Agent Observability’s trace visualization. In the following screenshot, we can see how the agent called resolve_sql, get_explain_plans, get_query_metrics, and get_table_schema before producing its recommendation.

Screenshot of an Agent Observability trace view for experiment “sonnet-worked-examples.” The left panel shows a LangGraph workflow timeline with sequential tool calls: resolve_sql (304ms), get_explain_plans (52ms), get_query_metrics (1.67ms), and get_table_schema (1.13ms), followed by a model call to langchain_anthropic.chat_models.ChatAnthropic (9.73s). The right panel shows the agent’s reasoning—identifying that an UPDATE statement with no WHERE clause would lock every row and generate excessive WAL bloat—and its recommendation to add a WHERE clause or process in batches of ~1,000 rows with a short sleep between iterations.

Phase 2: Compressing to a small model

Claude Sonnet 4.6 worked well for Phase 1, but at three times the cost of Haiku 4.5 ($3 input/$15 output per MTok versus $1 input/$5 output per MTok), it made sense to see if the quality gains could be compressed to the smaller model. The autoresearch driver explored two approaches for this.

First, it tried to directly transfer the Sonnet prompt to Haiku and find optimizations. Iterations that streamlined the prompt and added more worked examples failed to make up the response quality deficit introduced by running the original prompt on Haiku instead of Sonnet.

Applying a more rigorous, knowledge distillation–style approach broke through the challenge. The agent compared Sonnet and Haiku traces on the same cases in Agent Observability. In cases where Haiku got the wrong answer, the agent could directly compare with Sonnet for the same input and see exactly how it reasoned: which tools it called, what evidence it weighed, and how it arrived at the correct optimization type. The traces revealed that Haiku was confusing missing indexes with stale statistics and schema changes. The agent extracted four examples from Sonnet’s correct reasoning and added them to Haiku’s prompt. Both precision and recall improved.

The loop also experimented with on-demand skills: reusable instructions the agent can invoke for specific tasks like evidence gathering for missing index recommendations. Combining all hypotheses (distilled examples, skills, tool call hints) at once was unstable, but selective combinations worked better. The best single-pass Haiku version used distilled examples plus a self-verification step.

After another blind evaluation on 50 unseen cases, the agent confirmed that the new Haiku prompts generalize. Filtering by model name in Agent Observability surfaces just the Haiku experiments, making it easy to track progress within a single model family. The following screenshot shows the results of this test: P=0.837, R=0.823.

Screenshot of the LLM Observability Experiments view filtered by the search term “haiku,” comparing 11 experiments across 3 fields. A timeline chart plots judge_f1 (blue), judge_precision (pink/orange), and judge_recall (yellow) scores over time. The chart shows generally upward trends for precision and F1 across the session, with recall staying relatively stable, ending with precision and recall both near 0.8.

Breaking through the single pass ceiling

These results were strong, but just shy of the P=0.85 target the team had set. However, the autoresearch driver couldn’t find a way to improve them any further while sticking to a single Haiku call. The driver proposed splitting the problem into two passes: a high-recall detector followed by a surgical verifier.

The first iteration of the verifier was too aggressive and significantly reduced recall (P=0.921, R=0.588). The second was too soft to bring precision above the bar. The third struck the best balance by checking only five specific false-positive patterns identified through per-case error analysis.

The autoresearch agent also tested cross-model verification (Sonnet as a verifier for Haiku) and distillation to GPT-5.4 nano. But the aforementioned Sonnet-only approach worked the best. A final blind evaluation check on 50 unseen cases produced P=0.860, R=0.823, F1=0.803, as shown in the following screenshot.

Screenshot of the LLM Observability Experiments Compare view contrasting “blind-haiku-single-pass” (baseline) against “blind-haiku-twopass” (variant) across 50 unseen cases. The two-pass variant increased average duration from 11.9s to 19.1s (60.5% longer), improved F1 from 0.779 to 0.803 (3.0% increase), improved precision from 0.837 to 0.860 (2.8% increase), and held recall steady at 0.823 (0.0% change).

23 experiments later

The following graph shows the full journey taken by the autoresearch agent. It performed 23 experiments across three phases. Each discarded experiment narrowed the search space and informed the next hypothesis. 17 improvements were kept, while 6 were discarded. F1 progressed from 0.550 (zero-shot) to 0.803 (two-pass Haiku).

Screenshot of the LLM Observability Experiments timeline view comparing all 23 experiments across 3 metric fields, split visually into two sections: “Sonnet optimization” on the left and “Haiku optimization” on the right. Three colored metrics—judge_f1, judge_precision, and judge_recall—are plotted as dots over time, showing an overall upward trend from low precision scores in early Sonnet experiments to converging high scores in the final Haiku two-pass experiments. The final highlighted data point shows scores reaching approximately 0.8 or above across all three metrics.

At each step, the autoresearch reasoning and analysis output was saved to the corresponding experiment in Agent Observability Experiments as an audit log. This experiment infrastructure made it easy for the DBM team to track and analyze each step of this process, so they could find key learnings and understand what the autoresearch system had produced. Agent Observability Experiments enabled this by making every experiment a first-class object with:

A single source of truth

Every experiment records its configuration (model, prompt version, variables changed), its hypothesis (goal and expectation tags), and its results (per-case precision, recall, and F1 from the LLM judge). There is no “I think we tried that,” because the experiment list shows exactly what was tried and what happened. It’s also easy to surface experiments with common attributes (model or prompt version, tool path, etc.) and compare their evaluator scores. This makes validating experiments’ performance gains much simpler and more reliable.

Per-case trace inspection

When an experiment regresses, you need to understand why at the case level. Agent Observability Experiments captures the full agent trace for every case: which tools were called, what the model reasoned about, and what it produced. We used this to discover that Haiku was recommending new indexes when the real problem was stale statistics, which directly informed the distillation examples.

Filtering and grouping

Each experiment is tagged with phase, model family, and the variable that was changed (prompt, example, architecture). Filtering by haiku surfaces just the 11 Haiku experiments. Grouping by variable type reveals that architecture changes produced the biggest gains. These queries let you ask, “What have we tried on this model?” and get an answer in seconds.

Reproducibility

Every experiment command is deterministic: the same dataset, the same model, the same prompt version. If a result looks surprising, you can rerun the experiment and compare. The loop ran blind evals after each phase specifically because the experiment infrastructure made it cheap to do so.

The autoresearch loop produces experiments at a pace that overwhelms manual tracking. At four to eight experiments per session, the research history becomes unmanageable within a week. By supporting this process with Agent Observability Experiments, the DBM team was able to make the system practical and sustainable.

Try it yourself

This agentic experimentation methodology works for any AI agent, not just query optimization. The ingredients:

An evaluation dataset with real inputs, expected outputs, and metadata
A task function that wraps your agent
Evaluators that score output quality
The loop: hypothesize, experiment, measure, keep or discard

To learn more about running your own experiments, see our guide for building offline evaluations, and dive into the Agent Observability Experiments documentation. Agent Observability now has a free tier for your first 40,000 LLM spans. If you’re new to Datadog, sign up for a 14-day free trial.

Get Started with Datadog