Toto and BOOM unleashed: Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark

Ben Cohen

Emaad Khwaja

Afshin Rostamizadeh

Ameet Talwalkar

Othmane Abou-Amal

We are excited to announce a new open-weights release of Toto, our state-of-the-art time series foundation model (TSFM), and BOOM, a new public observability benchmark that contains 350 million observations across 2,807 real-world time series. Both are open source under the Apache 2.0 license and available to use immediately.

The open-weights Toto model, trained with observability data sourced exclusively from Datadog’s own internal telemetry metrics, achieves state-of-the-art performance by a wide margin compared to all other existing TSFMs. It does so not only on BOOM, but also on the widely used general purpose time series benchmarks GIFT-Eval and LSF (long sequence forecasting).

BOOM, meanwhile, introduces a time series (TS) benchmark that focuses specifically on observability metrics, which contain their own challenging and unique characteristics compared to other typical time series. As with Toto, BOOM’s data is sourced exclusively from our internal metrics, but from a separate environment to eliminate any cross-contamination with Toto training data. Further details can be found in our latest technical paper.

Charts showing BOOM results and GIFT-Eval results with Toto achieving state-of-the-art performance.

The challenges of observability

Observability metrics represent an important, unique, and challenging subset of TS data, raising the need for observability-focused models and benchmarks. Accurately modeling observability metrics is essential for critical operations like anomaly detection (e.g., identifying spikes in error rates), predictive forecasting (e.g., anticipating resource exhaustion or scaling needs), and dynamic autoscaling. However, observability time series have challenging distributional properties, including sparsity, spikes, and noisiness; anomaly contamination; and contain high-cardinality multivariate series, with complex relationships among variates.

Real-world observability systems also routinely generate millions or even billions of distinct time series, making individual fine-tuning or supervised training of complex models per time series infeasible. Finally, many of these series are short lived and correspond to ephemeral infrastructure, creating a “cold-start” problem. All these operational challenges provide the need for a zero-shot TSFM with excellent performance on observability data specifically.

Open-weights Toto

The new (151 million parameter) open-weights Toto model (model card) builds upon our initial proprietary model architecture, which combined a Student-t mixture model prediction head to model heavy-tailed observability time series and a proportional factorized attention mechanism to efficiently capture variate interactions.

Our new model also uses a novel patch-based causal normalization approach, which enables the model to generalize better to highly nonstationary data, and uses a composite robust loss to stabilize training dynamics. Detailed descriptions of all of these features as well as ablations are presented in our paper.

Diagram showing Toto pretraining data, model architecture, and zero-shot forecasts, as described in the paper.

One of the biggest factors in Toto’s success is that it is trained on a large amount of real-world observability data. The bar chart below shows the number of unique (non-synthetic) time series observations in the training repositories of several TSFMs and the fraction of Toto’s training data that consists of observability metrics specifically.

Chart showing the number of unique (non-synthetic) time series observations in the training repositories of several TSFMs and the fraction of Toto's training data that consists of observability metrics specifically.

As shown in the bar plots at the top of this post, the Toto model achieves state-of-the-art (SOTA) performance, gaining the best rank score by a significant margin on the BOOM and GIFT-Eval benchmarks (as of May 19, 2025).

On the smaller LSF benchmark, we also measure the performance of a fine-tuned Toto model (Toto_FT) and, again, find state-of-the-art performance. As is standard for this benchmark, we compute normalized mean squared error (MSE) and mean absolute error (MAE) for each of six different datasets within the LSF benchmark. Toto achieves the best mean performance across datasets, both in terms of MSE and MAE, in particular achieving a 5.8 percent improvement in MAE over the next best method.

BOOM evaluation framework

BOOM (Benchmark of Observability Metrics) (data card) contains 350 million observations across 2,807 real-world production series, where each series can contain multiple variates. We highlight the diversity of BOOM by visualizing a two-dimensional projection of each series—along with series from GIFT-Eval and LSF. The projection is generated by using the first two principal components computed from a total of six statistical features of each series.

Visual of a two-dimensional projection of each series—along with series from GIFT-Eval and LSF.

The evaluation procedure for the benchmark closely follows the methodology established by the GIFT-Eval benchmark. In particular, we calculate MASE (mean absolute scaled error) and CRPS (continuous ranked probability score) on all zero-shot models on three types of forecasting lengths: short, medium, and long (detailed numbers can be found in our paper). We then normalized these metrics by the seasonal naive performance and rank them based on the normalized CRPS.

Feature image AI-generated by ChatGPT.

Toto and BOOM unleashed: Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark

The challenges of observability

Open-weights Toto

BOOM evaluation framework

Get started with Toto and BOOM

Toto

BOOM

Related Articles

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

Detecting faulty deployments: Our journey from unlabeled data to supervised learning

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Integration roundup: Monitoring your AI stack

Start monitoring your metrics in minutes

Get Started with Datadog

The challenges of observability

Open-weights Toto

BOOM evaluation framework

Get started with Toto and BOOM

Toto

BOOM

Related Articles

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

Detecting faulty deployments: Our journey from unlabeled data to supervised learning

Detect hallucinations in your RAG LLM applications with Datadog LLM Observability

Integration roundup: Monitoring your AI stack

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes