AI

Introducing ARFBench: A time series question-answering benchmark based on real incidents

Published

Read time

7m

Introducing ARFBench: A time series question-answering benchmark based on real incidents
Othmane Abou-Amal

Othmane Abou-Amal

Ben Cohen

Ben Cohen

Ameet Talwalkar

Ameet Talwalkar

Stephan Xie

Stephan Xie

More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly.

An important task in incident response involves analyzing observability metrics, or time series data that captures a snapshot of the health of software systems. For example, an engineer for a service may use Datadog to answer questions like “When did latency start increasing?” and “What metrics outside of latency are also behaving abnormally?” to localize the root cause of the anomalous behavior.

These time series question-answering (TSQA) tasks are essential for engineers, and present challenging and necessary tasks for SRE models and agents to perform. In this work, we explore the degree to which AI models can perform TSQA tasks.

To this end, we’re excited to introduce the Anomaly Reasoning Framework Benchmark (ARFBench), a TSQA benchmark derived from real internal incidents at Datadog, using Datadog’s own internal telemetry data (Figure 1). In this post, we’ll present three key takeaways from our benchmarking experiments:

  • Leading LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) have substantial room for improvement on ARFBench.
  • We introduce a new hybrid TSFM-VLM that yields comparable overall performance to top frontier models on ARFBench, demonstrating promising new approaches to TSQA modeling.
  • We observe markedly different error profiles between our top TSFM-VLM and human experts on ARFBench. These results suggest that their strengths are complementary. We introduce a model–expert oracle that establishes a new superhuman frontier for LLMs, VLMs, and TSFMs.
Diagram showing the ARFBench pipeline, where time series data and incident timelines are used to generate templated question-answer pairs for evaluating models.
Figure 1: Workflow of ARFBench question-answer generation. Engineers use commercial messaging platforms to respond to incidents, where they typically send time series widgets that visualize relevant metrics. Time series and incident timelines from internally monitored incidents are used as input to an LLM pipeline and fit to eight different question templates testing various aspects of anomalies. The resulting multiple-choice question-answer pairs can be used to evaluate various predictive models.
Diagram showing the ARFBench pipeline, where time series data and incident timelines are used to generate templated question-answer pairs for evaluating models.
Figure 1: Workflow of ARFBench question-answer generation. Engineers use commercial messaging platforms to respond to incidents, where they typically send time series widgets that visualize relevant metrics. Time series and incident timelines from internally monitored incidents are used as input to an LLM pipeline and fit to eight different question templates testing various aspects of anomalies. The resulting multiple-choice question-answer pairs can be used to evaluate various predictive models.

ARFBench: Using real-world incident data to create a TSQA benchmark

ARFBench is a TSQA benchmark based on real incidents internal to Datadog, using our own internal telemetry data. Compared to existing benchmarks, ARFBench differs in three key aspects:

  • It uses real time series data from production systems.
  • Each question-answer (QA) example is grounded in expert annotations and additional context.
  • Tasks are designed to test compositional reasoning: Questions are organized into three tiers of increasing difficulty, with higher-tier tasks depending on correct reasoning performed in lower tiers (Figure 2).
Examples of ARFBench questions across three difficulty tiers, illustrating increasing reasoning complexity.
Figure 2: Example questions from each tier of ARFBench. ARFBench questions are designed in three tiers of increasing difficulty, with higher-tier tasks depending on correct reasoning on lower tiers.
Examples of ARFBench questions across three difficulty tiers, illustrating increasing reasoning complexity.
Figure 2: Example questions from each tier of ARFBench. ARFBench questions are designed in three tiers of increasing difficulty, with higher-tier tasks depending on correct reasoning on lower tiers.

ARFBench consists of 750 QA pairs drawn from 142 time series and 63 incidents. Time series in ARFBench have a maximum of 2,283 variates and 40,000 time steps, which present a challenging setting for context-limited models.

To create ARFBench, we built a VLM pipeline for extracting the time series widgets from internal incident discussion threads to help generate and filter QA pairs. We then manually verified every generated question for correctness and privacy concerns, and threw away questions that we found unsuitable.

Reasoning about time series and anomalies requires meaningful context across data modalities. ARFBench enriches time series with two types of context: time series captions, which describe what the time series represent, and multivariate groupings, which contextualize each channel relative to a larger relevant collection of time series channels. For instance, while it may not always matter that a single pod fails and restarts in a service, the combination of many pods failing and restarting simultaneously could indicate a significant anomaly. This level of complexity reflects real-world conditions that many existing unimodal, synthetic datasets fail to capture (Figure 3).

Multivariate time series showing how individual variates may appear normal alone but reveal anomalies when viewed within a group.
Figure 3: When analyzed alone, variates of a time series may not be anomalous. However, in the context of a grouping of variates, the same variate may be considered anomalous. The multivariate time series in this figure is based on the average remaining TLS certificate lifetime across different clusters and IDs of a particular service.
Multivariate time series showing how individual variates may appear normal alone but reveal anomalies when viewed within a group.
Figure 3: When analyzed alone, variates of a time series may not be anomalous. However, in the context of a grouping of variates, the same variate may be considered anomalous. The multivariate time series in this figure is based on the average remaining TLS certificate lifetime across different clusters and IDs of a particular service.

Frontier VLMs outperform existing baselines

We evaluated three categories of existing models on ARFBench:

  • LLMs, which take time series as text input
  • VLMs, which take time series plots as image input
  • Time series LLMs, which use a time series encoder with an LLM backbone

We compared the models to two human baselines: observability experts and time series researchers without extensive observability experience. The human experts were evaluated on a randomly sampled 25% subset of ARFBench.

Among existing models, GPT-5 (VLM) yielded the top performance at 62.7% accuracy and 51.9% F1 (Figure 4). This is much higher than the random choice baseline at 22.5% F1, but still underperforms domain experts and is far below a model-expert oracle at 87.2% accuracy and 82.8% F1 (further discussion follows below). As expected, model performance tends to worsen as the tier difficulty increases.

Bar chart comparing accuracy and F1 scores of foundation models and human experts on ARFBench.
Figure 4: Overall accuracy and F1 of various baselines and foundation models on ARFBench. Models are sorted by decreasing accuracy. The Toto-1.0-QA-Experimental achieves the top accuracy on ARFBench and yields comparable F1 to top frontier models.
Bar chart comparing accuracy and F1 scores of foundation models and human experts on ARFBench.
Figure 4: Overall accuracy and F1 of various baselines and foundation models on ARFBench. Models are sorted by decreasing accuracy. The Toto-1.0-QA-Experimental achieves the top accuracy on ARFBench and yields comparable F1 to top frontier models.

We also observe several trends with our evaluations on ARFBench. Corroborating previous works in time series classification and QA such as Daswani et al. (2024), we find that VLMs outperform LLMs. The top proprietary models and open source models also showed a substantial gap in performance. However, we find that some open source models perform better than many older proprietary models or models from the Claude family.

Hybrid TSFM-VLMs show promise for specialized TSQA modeling

Architecture diagram of a hybrid model combining a time series encoder, vision encoder, and text decoder with LoRA (low-rank adaptation) layers to process time series, images, and text inputs.
Figure 5: Architecture diagram of the Toto-1.0-QA-Experimental (Toto-Qwen3-VL) model. Frozen weights are denoted with a snowflake, while trainable weights are marked with a flame. We use low-rank adaptation (LoRA), a parameter-efficient fine-tuning method that adds a small number of trainable parameters, to align TSFMs and VLMs and yield novel abilities.
Architecture diagram of a hybrid model combining a time series encoder, vision encoder, and text decoder with LoRA (low-rank adaptation) layers to process time series, images, and text inputs.
Figure 5: Architecture diagram of the Toto-1.0-QA-Experimental (Toto-Qwen3-VL) model. Frozen weights are denoted with a snowflake, while trainable weights are marked with a flame. We use low-rank adaptation (LoRA), a parameter-efficient fine-tuning method that adds a small number of trainable parameters, to align TSFMs and VLMs and yield novel abilities.

Though VLMs yielded the highest accuracy and F1 score among existing models, we found that plotting and input representation was a challenge for both VLMs and LLMs. For example, due to the high number of variates, we often could not plot the time series without repeating colors or occluding different variates. This motivated a native time series approach alongside the VLM model in which we could utilize time series, plots, and text as joint input.

To test this, we trained a hybrid model (Figure 5) by combining Toto, our state-of-the-art observability TSFM, with Qwen3-VL-32B, a leading open source VLM. We used both synthetic (Figure 6) and real multimodal data in a multi-stage post-training pipeline incorporating both supervised fine-tuning (SFT) and reinforcement learning (RL).

The resulting model, Toto-1.0-QA-Experimental, yielded the top accuracy score of 63.9% and comparable F1 to top frontier models (48.9%). In the anomaly identification task category, where a model selects anomalous variates in the time series, Toto-1.0-QA-Experimental outperforms all models by at least 8.8 percentage points in F1 and achieves best per-category accuracy, suggesting that TSFM-VLM modeling can highly benefit performance on particular tasks. Furthermore, Toto-1.0-QA-Experimental’s parameter count is several orders of magnitude lower than frontier models, thus providing potential efficiency gains at inference time.

Flow diagram of synthetic time series generation, including noise sampling, adding seasonality and drift, injecting anomalies, and generating captions and reasoning.
Figure 6: Synthetic data generation flow for post-training hybrid TSFM-VLM and TSFM-LLM models. Time series are generated by first sampling different lengths and scales and then by sampling each datapoint from a normal distribution. To add variation, we add seasonality and drift components into the time series, yielding different base time series (top right). For each base time series, we apply question templates and inject different anomalies (for example, level shift, change in seasonality) at various points of the time series (bottom right). Finally, we generate time series captions and reasoning for the question-answer pair using a VLM.
Flow diagram of synthetic time series generation, including noise sampling, adding seasonality and drift, injecting anomalies, and generating captions and reasoning.
Figure 6: Synthetic data generation flow for post-training hybrid TSFM-VLM and TSFM-LLM models. Time series are generated by first sampling different lengths and scales and then by sampling each datapoint from a normal distribution. To add variation, we add seasonality and drift components into the time series, yielding different base time series (top right). For each base time series, we apply question templates and inject different anomalies (for example, level shift, change in seasonality) at various points of the time series (bottom right). Finally, we generate time series captions and reasoning for the question-answer pair using a VLM.

We refer interested readers to our paper for more experimental details, error analysis, and case studies.

Models complement domain experts and set a new superhuman frontier

The current aggregate gap on ARFBench between the best models (Toto-1.0-QA-Experimental & GPT-5) and the two human domain experts is only 8.8 percentage points in accuracy and 12.7 percentage points in F1. However, at the individual question level, we observe noticeably different behavior between GPT-5 and the human experts. GPT-5 answers 48% of questions correctly that both experts get incorrect; on these questions, the human experts tend to make errors in instruction-following or fine-grained perception. Meanwhile, at least one expert correctly answers 79% of questions that GPT-5 gets incorrect. On these sets of questions, model errors tend to involve hallucination and incorrect domain knowledge. We provide examples of both groups of errors in the paper.

Due to the large difference in error distribution, we hypothesize that when experts are complemented with models, their joint capability becomes much higher than any single expert or model alone. To establish this, we compute a model-expert oracle, a best-of-2 metric where an oracle perfectly chooses the best answer between the model and the expert, which yields 87.2% accuracy and 82.8% F1 on our data. This is far above existing model capabilities and sets a new superhuman frontier for LLMs, VLMs, and TSFMs to achieve.

What’s next: Time series reasoning as a core component of agents

In the broader scope of incident response, ARFBench only contains questions targeting diagnosis and reasoning. However, we envision that strong diagnosis and reasoning abilities will play a large part in end-to-end agentic systems (for example, SRE or incident response agents) that require time series reasoning as a subroutine in understanding the incident. While ARFBench can be used to evaluate time series agents, it is not currently a multi-turn benchmark. However, we believe that future agents and models that perform well on the single-turn ARFBench will ultimately perform better on end-to-end tasks.

Getting started with ARFBench

If you are interested in testing your model on ARFBench, you can find the benchmark, leaderboard, and model weights on Hugging Face, and the evaluation code on GitHub.

To learn more, read our technical paper.

If you’re interested in building the next generation of AI-powered observability, we’re hiring.

Related Articles

Fine-tune Toto for turbocharged forecasts

Fine-tune Toto for turbocharged forecasts

Toto and BOOM unleashed: Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark

Toto and BOOM unleashed: Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark

Introducing our open source AI-native SAST

Introducing our open source AI-native SAST

When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos

When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos

Start monitoring your metrics in minutes