Fine-tune Toto for turbocharged forecasts

Salahidine Lemaachi

Anna-Monica Toon

Viktoriya Zhukova

Ben Cohen

In recent work, we released Toto, Datadog’s state-of-the-art time series foundation model. Toto’s strong zero-shot performance—you can point it at a new time series and immediately get accurate forecasts without needing extensive historical data—has been well received by the broader AI community, with over 9 million downloads from Hugging Face.

As we work to deploy Toto in real use cases at Datadog, though, we’ve seen that in many real-world scenarios, we do have historical data in our target domain. In these cases, fine-tuning Toto on even a small amount of in-distribution data (on the order of tens of thousands of time series points) can help with accuracy. To get the best performance, we’ve found it pays to go beyond zero-shot.

Often, it turns out we know a lot more than just the raw time series we want to forecast; there may be predictive features for which we know the future values in advance. For example, when looking at cloud cost data, we found that Toto’s forecasts were much better when the model was aware of billing cycles/day-of-the-month effects. In other cases, we might know when marketing campaigns will launch, when new regions will come online, or how demand responds to holidays or other calendar events. These signals—referred to as exogenous covariates in the forecasting literature—can lead to dramatically better predictions.

That’s why in this post, building on our existing open weights model and open source inference code, we’re excited to announce two big upgrades to Toto:

Official fine-tuning support so you can adapt Toto to your own workloads.
Support for exogenous covariates so you can steer Toto’s forecasts using known future signals.

By adding these new capabilities, we’re happy to be able to share some of the work we’ve been doing to bring Toto into our products with the open source community. You can check out these new features today in the Toto repository on GitHub.

Toto now supports fine-tuning

Zero-shot forecasting is fantastic when you’re just getting started: you can get zero-shot forecasts from Toto with a few lines of code and sub-second latency. But for critical tasks where you need the best possible forecasting accuracy, you’ll often want to adapt the model to your particular domain. Fine-tuning lets Toto specialize on those patterns while still benefiting from its broad, pre-trained understanding of observability metrics and other time series data. Since releasing Toto, fine-tuning has been our most-requested feature.

With this release, we’re adding:

Reference fine-tuning code in the Toto repository, using the same architecture we ship for inference.
A simple fine-tuning recipe you can apply to your own datasets that can be run easily on a single GPU.
A tutorial notebook showing an end-to-end workflow for an example dataset

In practice, we’ve found that fine-tuning can lead to meaningful improvements over zero-shot performance when you have even a modest amount of domain-specific training data.

Exogenous covariates: Steer your forecasts with known future predictors

In many forecasting problems, you know more than just the past behavior of the time series you’re trying to predict. You may also track other signals that influence it, and, crucially, you may know their future values at prediction time.

Common examples include:

Holidays and events: You know the dates of Black Friday, Cyber Monday, product launches, and marketing campaigns.
Planned capacity or spend: You know when new regions will launch or when infrastructure upgrades will land.
Business calendars: You know quarter boundaries, billing cycles, or internal maintenance windows.
Weather forecasts: You have access to predicted temperature, humidity, or wind for upcoming hours or days.

In the time series literature, these kinds of variables are called exogenous covariates. When you have them—and especially when you know their future values—they can be essential to getting the best possible forecast. That’s why we’re especially excited that this release goes beyond just fine-tuning the existing Toto model and adds a new capability: fine-tuning with exogenous covariates.

With the new API:

In your training data, you can mark specific features as exogenous covariates.
At inference time, you can pass in the known future values of these covariates when forecasting so Toto can “see” that, for example, a heat wave or a holiday is coming and adjust its forecast accordingly.

This gives you a way to steer the model: rather than hoping it rediscovers the impact of major events from the target series alone, you explicitly tell it what’s coming.

Example 1: Cloud cost forecasting

To illustrate how to take advantage of fine-tuning with exogenous covariates, let’s look at a common forecasting problem: cloud infrastructure costs. Getting accurate projections of cloud spend is an important challenge, both for Datadog and for our customers.

We use an example internal dataset of historical cloud spend. The task is to predict the daily cost, and we have access to a simple external covariate: a binary flag indicating if the day is the first of the month. Since billing cycles often align with the start of the month, this irregular exogenous covariate (EV) can be a significant driver of temporary spikes in spend, and including it can improve predictions.

We compare four models:

Baseline model: We use the previously deployed forecasting baseline—a tuned statistical model based on SARIMAX, with AutoARIMA for hyperparameter selection.
Toto zero-shot: We apply Toto directly to the cloud cost series without any fine-tuning or exogenous covariates.
Toto fine-tuned on the target: We fine-tune Toto on historical cost data, but still only use the target series.
Toto fine-tuned with exogenous covariates: We fine-tune Toto on a multivariate dataset that includes both historical cost and the “first day of the month” flag. At inference time, we feed Toto the future “first day of the month” flag along with the past cost history.

Visually, we can see that the zero-shot Toto forecast is unable to model the irregularly-spaced monthly seasonality; it completely fails to predict the spikes. The fine-tuned version does a better job, but still misses some of the spikes. And the version fine-tuned with the exogenous covariate (EV) comes closest to the observed values, successfully capturing the spikes at the beginning of the month, while also modeling weekly patterns that are not captured by the baseline model:

Four stacked time series plots compare daily cloud cost forecasts from a SARIMAX baseline, zero-shot Toto, fine-tuned Toto without exogenous covariates, and fine-tuned Toto with exogenous covariates. Each plot shows the historical target series, forecast median, and a 95% prediction interval, with a vertical dashed line marking the start of the forecast window. The baseline and zero-shot models miss irregular monthly cost spikes, while fine-tuning improves accuracy and fine-tuning with exogenous covariates most closely captures the spikes and overall seasonality. — Comparing forecasts from the baseline model and Toto across its zero-shot, fine-tuned, and fine-tuned-with-exogenous-covariates variants. The fine-tuned model with exogenous covariates (bottom panel) best captures both the irregular monthly spikes and the intermediate oscillations, while maintaining well calibrated uncertainty intervals.

Quantitatively, we assess the forecast using a set of evaluation metrics selected by internal stakeholders: mean absolute error (MAE) and mean squared error (MSE) on the daily forecasts, and mean absolute percentage error (MAPE) on monthly roll-ups (for cloud cost data, accurately forecasting spend over the entire month is a core business objective). With no fine-tuning, zero-shot Toto fails to consistently outperform the baseline. The version of Toto fine-tuned on the target variable does manage to perform slightly better than the baseline; meanwhile, the Toto model fine-tuned with exogenous covariates (EV) shows significant improvement across all metrics.

Metric ↓	Baseline (SARIMAX)	Zero-shot Toto	Fine-tuned Toto without EV	Fine-tuned Toto with EV
MAE ↓	0.347	0.333	0.304	0.291
MSE ↓	0.622	0.658	0.589	0.490
MAPE (monthly) ↓	0.104	0.148	0.094	0.082

Table 1 shows a performance comparison of Toto variants on an internal cloud cost dataset. Metrics are aggregated from predictions over sliding windows over the test set. Best results are shown in bold; second-best results are shown in italics.

Example 2: Public benchmark datasets

To further validate the performance of fine-tuning Toto with exogenous covariates, we tested our method on a selection of publicly available datasets. We selected a subset of 13 datasets from within the fev datatests collection—datasets with exogenous covariates, and for which there was no overlap with Toto’s pretraining data.

These benchmarks include data with exogenous covariates from various domains, including:

Energy demand forecasting, using external weather forecasting information as a predictor of expected load
Air quality forecasting, modeling pollutant concentrations conditioned on temperature and humidity measurements
Electricity price forecasting across several European markets, incorporating exogenous inputs such as generation and load forecasts
Retail demand forecasting, covering transactional orders and sales conditioned on calendar-based exogenous covariates (for example, promotions and holidays)

We see substantial improvement in average performance on standard forecasting metrics such as weighted quantile loss (WQL) and mean absolute scaled error (MASE) when going from zero-shot to fine-tuning, and an additional boost when including exogenous covariates. This is not universally the case for every individual dataset. When using Toto on your own data, it’s important to experiment on a held-out test set with different settings (zero-shot, fine-tuning, and fine-tuning with exogenous covariates) to identify the best recipe.

Metric ↓	Zero-shot Toto	Fine-tuned Toto without EV	Fine-tuned Toto with EV
WQL (Avg.) ↓	0.111	0.100	0.096
WQL (Avg. Rank) ↓	2.284	1.753	1.499
MASE (Avg.) ↓	0.632	0.574	0.535
MASE (Avg. Rank) ↓	2.433	1.931	1.277

Table 2 shows a performance comparison of Toto variants on a selection of 13 public datasets. For each dataset, models are evaluated using sliding windows over the test split; scores are first aggregated at the dataset level and then aggregated across all datasets using a geometric mean. Best results are shown in bold; second-best results are shown in italics.

Full code for reproducing this example—including data loading, feature construction for exogenous covariates, and training scripts—is available in the Toto repository.

Deep dive: How Toto uses exogenous covariates under the hood

Let’s unpack how exogenous covariates fit into Toto’s architecture.

Toto as an autoregressive patch model

Toto is an autoregressive model: during training, it’s optimized to predict the next patch of the time series based on the previous patches. Concretely, we break the time series into non-overlapping patches (windows of contiguous datapoints), and for each position, the model receives a context of past patches and is trained to predict the next one. During pretraining, we shift the targets one patch forward relative to the inputs; this is analogous to how LLMs are trained to predict the next token in a sequence.

Injecting exogenous covariates

To incorporate exogenous covariates, we want the model to know the future values of those variables at each time step it’s predicting without training it to forecast those covariates themselves. Because Toto is natively multivariate (it already has the ability to model interactions between multiple variates), with slight modifications we can add known future exogenous covariates to our forecasts.

We do this in three steps:

Shift exogenous inputs forward: For the exogenous covariates, we shift the inputs forward by one patch (see Figure 2), aligning them in time with the targets, and stack them along the variate dimension. Intuitively, when Toto is asked to predict the patch at time t+1, it is given not only the history up to t, but also the values of the exogenous covariates at t+1. This ensures that, during training, the model always “knows” the exogenous covariates for every time step it is trying to predict.
Prepend a variate label token to the sequence in embedding space: Variate-wise attention in Toto is permutation invariant and does not use positional encodings. During fine-tuning, we want the model to be able to learn which variates are exogenous and how they relate to the targets. To enable this, we prepend a learnable label token to each variate sequence in embedding space, using one token for target variates and another for exogenous covariates.
Mask the loss on exogenous covariates: Since future values of the exogenous covariates are assumed to be known, we do not care about predicting them. Instead, we mask the loss for the exogenous dimensions so that Toto is only penalized for errors on the target variable(s). This allows the model to freely use the exogenous inputs as conditioning information without being forced to produce forecasts for them.

Diagram illustrating how exogenous covariates are incorporated into Toto's autoregressive patch model. The target time series is divided into patches, and the model predicts the next target patch. Exogenous covariate patches are shifted left by one patch so that known future values align with the target being predicted. Separate label tokens for target variates and exogenous covariates are prepended to the input sequence, allowing the model to distinguish between them while using the exogenous covariates as conditioning information. — Exogenous covariates are added to the input by shifting them left by one patch position. This has the effect of moving them forward in time relative to the inputs for the target variable. Additionally, we add the corresponding variate label token to the beginning of each sequence.

Because exogenous covariates are handled at the data and loss level, we can keep the core Toto architecture unchanged. This makes it easier to reuse existing infrastructure—model weights, inference code, and scaling strategies—while still gaining the benefits of richer side information.

Fine-tune Toto on your own data

If you’re working with time series that depend on known-future signals—like demand driven by holidays, cloud capacity shaped by rollouts, or energy usage influenced by weather—fine-tuning Toto with exogenous covariates is a natural next step. We’re excited to see what you’ll build with Toto’s new fine-tuning and exogenous covariates capabilities. Try fine-tuning with exogenous covariates today using the reference code and examples in the Toto repository, including the energy forecasting demo from this post.

If building models like Toto—and pushing the frontier of AI for observability—sounds exciting, check out open roles at Datadog AI Research.

Fine-tune Toto for turbocharged forecasts

Toto now supports fine-tuning

Exogenous covariates: Steer your forecasts with known future predictors

Example 1: Cloud cost forecasting

Example 2: Public benchmark datasets

Deep dive: How Toto uses exogenous covariates under the hood

Toto as an autoregressive patch model

Injecting exogenous covariates

Fine-tune Toto on your own data

Related Articles

Toto and BOOM unleashed: Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark

Detecting hallucinations with LLM-as-a-judge: Prompt engineering and beyond

When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos

Detect human names in logs with ML in Sensitive Data Scanner

Start monitoring your metrics in minutes