In the first installment of this series, we introduced Chronos-2, a foundation model designed for time-series forecasting. We rolled up our sleeves and explored a hands-on case study to see what Chronos-2 can achieve right out of the box, without any additional training.
However, as we mentioned at the conclusion of Part 1, zero-shot predictions don’t always cut it.
There are several situations where this approach falls short:
Your dataset might be completely different from what the model was trained on.
The model might consistently produce the same types of errors.
You have a wealth of historical data that could be put to better use.
Your specific forecasting goal might not align with what Chronos-2 was originally optimized for.
That’s where fine-tuning comes into play.
In this article, we’ll pick up where we left off with the same building electricity-demand example from Part 1. We’ll guide you through five different ways to fine-tune Chronos-2:
Adapting to a single building: fine-tuning the model for one specific asset.
Fine-tuning across a portfolio: combining historical data from multiple buildings to create a shared adapter.
Fine-tuning with covariate information: incorporating known future signals into the fine-tuning process.
Combining portfolio and covariate data: making the most of both fleet-wide and covariate information.
Transferring to unseen assets: adapting the model once, then applying it to buildings it never encountered during fine-tuning.
By the time you finish reading, you’ll have a practical blueprint for fine-tuning a time-series foundation model that you can apply to your own datasets.
Part 1 of this series covers how to use Chronos-2 for forecasting across univariate, multivariate, covariate-informed, and cross-learning scenarios. If you’re interested in using Chronos-2 without any customization, take a look at that post here.
1. Revisiting the case study
Let’s quickly recap the setup from Part 1.
We’re working with a synthetic dataset that tracks hourly electricity demand across eight commercial buildings. Our goal is to predict the total electricity consumption one week in advance—that’s 168 hours. The dataset is generated using a physical simulator, which breaks down the total load into four components: base load, plug load, lighting load, and HVAC load. In practical terms, plug and lighting loads are driven by weekday occupancy patterns, while HVAC load depends on outdoor temperature.
What’s new in Part 2 is that we’ve extended the simulation period to generate enough data for fine-tuning. We’ve also maintained a clear boundary between the data used for fine-tuning and the data used for evaluation. Specifically, we’ve divided the timeline into four consecutive segments:
Training (12 weeks): March 1, 2025 to May 22, 2025—this is the only data the fine-tuning process can access.
Validation (1 week): May 23, 2025 to May 29, 2025—used for selecting checkpoints and implementing early stopping.
Inference context (45 days): May 30, 2025 to July 13, 2025—this window provides the context for making forecasts. The zero-shot approach in Part 1 also used 45 days of context.
Test (1 week): July 14, 2025 to July 20, 2025—the forecast horizon for evaluating the fine-tuned model.
It’s important to note that the fine-tuning process only has access to data from the training and validation sets, so there’s no risk of data leakage in our analysis.
Figure 1. Train/val/context/test split. (Image by author)
2. A quick overview of fine-tuning and LoRA
Before we dive into the practical steps, let’s take a moment to understand what fine-tuning is and explore one of its key techniques: LoRA.
2.1 What exactly is fine-tuning?
Fine-tuning involves taking a model that’s already been pretrained and continuing to train it on your specific dataset. Essentially, we’re adjusting the pretrained model’s weights so it can better recognize and follow the patterns unique to our problem.
In the case of Chronos-2, it’s a 120-million-parameter Transformer that has already absorbed a broad understanding of time-series structures. Fine-tuning allows us to gently steer its behavior to better match our particular data.
But do we really need to update all 120 million parameters?
Probably not.
Doing so would be computationally expensive and require significant storage. Plus, in most real-world scenarios, we simply don’t have enough data to justify adjusting every single parameter.
We need a smarter, more efficient approach. Enter LoRA.
2.2 What is LoRA?
LoRA, which stands for Low-Rank Adaptation [1], is built on a straightforward concept: rather than modifying the full weight matrices, we keep the original pretrained model frozen and only train a small set of additional parameters that make minor adjustments to its behavior.
To illustrate, imagine one layer in the pretrained model has a weight matrix W with dimensions d_out × d_in, where d_out = d_in = 1024.
Updating this weight matrix would look like this:
In this case, ΔW would also need to be 1024 × 1024. A full update would mean adjusting over a million trainable parameters.
The clever trick behind LoRA is that ΔW isn’t learned as a complete matrix. Instead, LoRA breaks it down into the product of two much smaller matrices:
Here, A has dimensions r × d_in and B has dimensions d_out × r. The value r is known as the rank of the adapter. The term “low-rank” comes from the fact that r is typically quite small—commonly 4, 8, 16, or 32.
What this means in practice is that LoRA prevents the fine-tuning process from making unrestricted, full-dimensional changes to W. Instead, updates are confined to a lower-dimensional subspace. And it’s precisely this constraint that makes the method so efficient.
Here is the paraphrased version of the article:
This approach works well in real-world applications because most downstream tasks don’t require adjusting the model in every possible direction. Typically, the necessary adjustments exist within a much more limited subspace. LoRA takes direct advantage of this principle.
In practice, this method offers multiple benefits. With significantly fewer parameters to train, we can dramatically reduce GPU memory consumption, which is primarily used by gradients and optimizer states. Checkpoints also become much smaller since we don’t need to store a complete copy of the 120M-parameter model for each experiment—only the adapter weights. Additionally, this approach helps minimize the risk of overfitting, particularly when working with smaller downstream datasets.
3. How to do LoRA for Chronos-2?
When implementing LoRA for the Chronos-2 model, the first decision involves identifying which layers to adapt.
To make this decision, we need to understand the model’s architecture.
As discussed in Part 1, Chronos-2 is a Transformer encoder structured around three main components:
An input patch embedding layer.
Multiple attention layers that alternate between time attention and group attention mechanisms.
An output patch embedding layer.
Our LoRA setup focuses on adapting two of these three components:
The Q, K, V, and O projections across all attention layers. These projections allow us to adjust how the model processes temporal patterns within individual time series and relationships across different series within a group.
Within Chronos-2, each attention layer contains four linear projections that transform the layer’s input to its output. The query (Q), key (K), and value (V) projections create three distinct representations of the input. The attention mechanism then calculates similarity scores between each query and key pair, using these scores to produce a weighted combination of the values. Finally, the output projection (O) merges information from all attention heads and adjusts the dimensions to align with the layer’s expected output size.
The output patch embedding. This enables us to adjust how the model converts its internal representations into final predictions.
In this configuration, lora_alpha serves as a scaling parameter that determines the intensity of the LoRA adjustments, with higher values leading to more substantial adaptations.
For our experiments, we utilize the Hugging Face peft library to fine-tune Chronos-2.
Now we’re prepared to dive into the practical implementation.
4. Five fine-tuning scenarios
In the experiments that follow, we begin with the identical base model—the amazon/chronos-2 checkpoint—and apply the same LoRA configuration throughout. The variable we change is the training data used for fine-tuning.
Our primary evaluation metric is weighted absolute percentage error:
With this framework established, let’s examine each of the five scenarios in detail.
If you haven’t configured the Chronos environment yet, please consult Part 1: 4.1 Setting up the Chronos-2 model.
4.1 Single-building adaptation
Is it possible to fine-tune using data from just one asset?
Let’s say we’re focused on a single building, such as Building 03. We have its historical energy consumption data and want to customize Chronos-2 to capture this building’s specific usage patterns.
This represents the most straightforward fine-tuning scenario. There are no additional covariates or portfolio data to consider—just one target time series.
As previously described, we initialize from the amazon/chronos-2 checkpoint, keep the base model weights fixed, and train only a compact LoRA adapter.
Chronos-2’s fine-tuning interface requires training data formatted as a list of task dictionaries. For our univariate task with a single target, each dictionary requires just one key: target.
Here’s how to prepare the fine-tuning input for Building 03:
First, a quick reminder: this validation data doesn’t update the LoRA adapter weights; it helps us select the best adapter checkpoint. This follows the standard practice used when training any neural network.
Second, you may have noticed that validation_df includes not just May 23-29 but all earlier data as well. This is necessary because Chronos-2 requires historical context to generate forecasts. Based on the configured prediction_length, Chronos automatically uses the final prediction_length hours of validation_df as the actual validation forecast target, with the preceding values serving as context.
In the current setup, we defined just a single validation task within validation_inputs. As a result, there’s effectively only one validation forecast window. This is because Chronos-2 always takes the final prediction_length time steps from the dataframe as the target window, and the context_length steps before that as context—regardless of how much additional data you include in the dataframe. So, simply providing a longer validation dataframe won’t generate extra validation windows on its own.
If you actually want multiple validation forecast windows—for instance, to perform rolling-window validation—you’ll need to define several separate validation tasks, each ending at a different cutoff point. That way, Chronos-2 will evaluate the last 168 hours of each individual task.
For training, however, no special handling is required. You can just feed Chronos-2 a long historical time series, and it will internally sample numerous training windows from it.
Here, prediction_length is set to 168, aligning the training objective with our actual test-time goal: forecasting one week ahead at hourly resolution. We also set context_length to 45 * 24, giving the model a 45-day lookback window—the same context length used in Part 1. Additionally, because we’ve provided validation_inputs, automatic checkpoint selection is enabled. Every 25 training steps, Chronos-2 computes the validation loss; if it fails to improve for 6 consecutive checks (early_stopping_patience=6), training halts early.
Figure 2. Training loss continues to decrease, but validation loss starts increasing after the first checkpoint. (Image by author)
I ran this fine-tuning job on an NVIDIA RTX 2000 Ada Laptop GPU with 8 GB of VRAM. The entire run completed in roughly 42 seconds.
Once the adapter is trained, making predictions works almost identically to zero-shot forecasting:
For Building 03, the target-only zero-shot baseline achieves a WAPE of 8.3%. After fine-tuning exclusively on Building 03’s data, the WAPE drops to 7.6%. This confirms that fine-tuning does yield measurable improvements.
4.2 Portfolio fine-tuning
Can we combine historical data from the entire portfolio to train a single shared adapter?
In real-world scenarios, we often manage a collection of related assets.
In our case, that means eight buildings. While they aren’t identical, they exhibit similar daily and weekly demand patterns.
This raises a natural question: can we fine-tune one adapter using data from the entire building portfolio, rather than training a separate adapter for each building?
Here, we’re still forecasting only total_load_kw, so the setup is nearly identical to before:
target_column = "total_load_kw"
train_inputs = [
{
"target": building_df[[target_column]].to_numpy(dtype="float32").T,
}
for _, building_df in train_df.groupby("building", sort=True)
]
validation_inputs = [
{
"target": building_df[[target_column]].to_numpy(dtype="float32").T,
}
for _, building_df in validation_df.groupby("building", sort=True)
]
Effectively, each building becomes a single training task. We then fine-tune Chronos-2 using the same LoRA configuration as before:
It’s important to note that we’re not training eight independent adapters here. Instead, we’re asking Chronos-2 to learn a single shared adaptation that generalizes across the whole portfolio. If there are recurring patterns across buildings, the adapter has more opportunities to capture them. On the other hand, if each building behaves entirely independently, this approach may offer limited benefit.
The fine-tuning results are shown below, comparing forecast accuracy between zero-shot and fine-tuned Chronos-2:
Building Zero-shot WAPE Fine-tuned WAPE
Building 01 8.0% 7.4%
Building 02 12.2% 11.3%
Building 03 8.3% 7.5%
Building 04 8.0% 7.6%
Building 05 7.2% 6.8%
Building 06 10.9% 9.9%
Building 07 7.7% 7.2%
Building 08 6.6% 6.3%
We observe improvements across every building, which strongly suggests that all buildings benefit from the shared adapter.
4.3 Covariate-informed fine-tuning
Can we provide Chronos-2 with known covariates during fine-tuning?
Up to this point, Chronos-2 has only seen the target series itself—that is, historical total_load_kw values.
However, in our building-demand scenario, we do have access to—or can reasonably forecast—the key driving factors: outdoor temperature, occupancy patterns, solar irradiance, and weekend indicators. These are the covariates that influence changes in total_load_kw.
So, in this fine-tuning scenario, we want to explore whether Chronos-2 can be fine-tuned not just on the target’s history, but also on the relationship between the target and known-future covariates.
This requires changing the fine-tuning input format. Instead of passing only the target, each training task should now also include past_covariates and future_covariates:
The past_covariates section holds the historical data for each covariate. During fine-tuning, Chronos-2 learns how changes in temperature, occupancy, solar irradiance, and weekend indicators affect the load.
The future_covariates section informs Chronos-2 that these same covariates will be available during the forecast period. Here, we set them to None because Chronos-2 internally generates future windows from the historical data. Later, during inference, we will supply the actual future covariate values via future_df, just as we did in Part 1.
The fine-tuning call itself remains nearly identical:
For Building 03, the covariate-informed zero-shot WAPE is 4.0%. After fine-tuning the covariate-informed adapter on Building 03, WAPE drops to 2.8%, achieving a 30.7% relative reduction.
This improvement is significantly larger than what target-only fine-tuning achieved.
There is also a valuable practical takeaway here: sometimes the biggest improvement does not come from fine-tuning alone, but from fine-tuning the model with the right information.
4.4 Portfolio + covariates
Can we combine both covariate and fleet information for fine-tuning?
The previous two scenarios introduced the “Portfolio” and “covariate” components independently. Naturally, we want to leverage both together.
This setup is what I consider most relevant for many real-world use cases, since in practice we rarely deal with just a single asset, and we often have access to known or forecastable external signals that can improve target series forecasting. Using both for fine-tuning is not only logical but likely the preferred approach.
Specifically, for our current case, we fine-tune across all eight buildings, and for each building we provide total_load_kw as the target along with outdoor_temp_c, occupancy, solar_irradiance, and is_weekend as known-future covariates:
train_inputs = []
for building, building_df in train_df.groupby("building", sort=True):
building_df = building_df.sort_values("timestamp")
train_inputs.append(
{
"target": building_df[["total_load_kw"]]
.to_numpy(dtype="float32")
.T,
"past_covariates": {
column: building_df[column].to_numpy(dtype="float32")
for column in known_future_columns
},
"future_covariates": {
column: None
for column in known_future_columns
},
}
)
In the code above, we create one training task per building. The same approach applies to the validation data as well. Each building gets one validation task, and Chronos-2 uses the last 168 hours of each task as the validation forecast window.
The fine-tuning call itself still remains the same:
The figure below shows the fine-tuning results for Building 03, where the improvement from fine-tuning is clearly visible:
Figure 3. Portfolio + covariate fine-tuning compared with the plain zero-shot forecast for Building 03. (Image by author)
Across all eight buildings, the plain zero-shot baseline has a WAPE of 8.4%. After portfolio + covariate fine-tuning, WAPE drops to 2.8%, a 66.8% relative reduction.
4.5 Held-out transfer
Can we adapt once, then deploy on assets the model never saw during fine-tuning?
So far, every fine-tuning scenario has involved the same buildings that later appear at inference time.
There’s one more critical scenario to consider: what happens if a brand new building has only just been added to the network?
In this final test case, we deliberately exclude Building 06 from the fine-tuning process so that Chronos-2 has no exposure to its data while training the LoRA adapter. We train on the remaining seven buildings using both historical target values and known-future covariates. Once training is complete, we then apply the learned adapter to Building 06 for prediction.
The required code change is straightforward:
held_out_building = "Building 06"
train_buildings = [
building
for building in sorted(train_df["building"].unique())
if building != held_out_building
]
train_inputs = []
for building in train_buildings:
building_df = train_df[
train_df["building"].eq(building)
].sort_values("timestamp")
train_inputs.append(
{
"target": building_df[["total_load_kw"]]
.to_numpy(dtype="float32")
.T,
"past_covariates": {
column: building_df[column].to_numpy(dtype="float32")
for column in known_future_columns
},
"future_covariates": {
column: None
for column in known_future_columns
},
}
)
At prediction time, Building 06 becomes our forecast target:
For Building 06, the covariate-informed zero-shot model starts with a WAPE of 4.2%. After applying the adapter that was fine-tuned on the other seven buildings, the error drops to 3.1%. That translates to a 26.8% relative improvement.
When it comes to real-world production use, the Q5 setup we tested above represents a far more practical and scalable approach. The idea is simple: fine-tune an adapter on a representative group of buildings, then roll it out to newly connected assets as they join the system. For each new asset, we still feed it recent operational context and known-future covariates, but there’s no need to retrain the adapter right away. In any case, a brand new building wouldn’t have enough historical data to make immediate fine-tuning worthwhile.
5. What did we learn?
Now that we’ve examined all five scenarios in detail, let’s lay the results out side by side.
For each scenario, we benchmark the fine-tuned model against its corresponding zero-shot baseline. Specifically, target-only fine-tuning is measured against target-only zero-shot prediction, and covariate-informed fine-tuning is measured against covariate-informed zero-shot prediction:
Figure 4. Fine-tuning delivers improvements across all five scenarios. The setups informed by covariates showed the most significant gains. (Image by author)
The trend is quite consistent. Fine-tuning on target data alone provides some benefit, but the gains are relatively modest. The real improvements emerge when known-future covariates are incorporated and the adapter is fine-tuned around them. The held-out transfer result is also promising: even for a building that was completely left out of the fine-tuning phase, the adapter is able to capture patterns from similar buildings and still outperform the covariate-informed zero-shot baseline.
You can find the full notebook here:
Reference
[1] LoRA: Low-Rank Adaptation of Large Language Models. arXiv, 2021.