Think of machine learning like a high-stakes competition where building the best ensemble—combining multiple models and techniques—can be worth millions of dollars. Even tiny improvements in performance metrics can translate into massive financial gains for teams striving to lead the field. Success demands not only flawless individual components but also perfection in how those components are integrated.
The state of the art
For years, gradient boosted models have dominated tabular and time series prediction tasks. These are ensemble methods because they blend outputs from several simpler models to produce a final result that outperforms any single one alone. However, the landscape is shifting. Pre-trained models like TabPFN for tabular data and Chronos for time series are now matching or even surpassing gradient boosted models on key benchmarks. In a sense, these newer models also function as ensembles—but instead of combining many predictions, they’re effectively an amalgamation of the diverse data they were trained on. This underlying principle is broadly influential and opens doors to further innovation.
Today, two fundamentally different approaches are fiercely competing at the top of machine learning leaderboards, with numerous other architectures close behind—each carrying distinct strengths and weaknesses. Because these models learn differently and often from varied data sources, combining them into a single, hybrid ensemble can preserve most advantages while canceling out many limitations. When executed well, this strategy consistently delivers stronger performance and more reliable models.
Assertions and assumptions
The same logic used to identify which data matters most for accurate predictions can also help determine which models contribute most to those predictions. Just as gradient boosted models outperform individual estimators by fusing their outputs, a diverse mix of models typically beats any lone approach.
For this discussion, we’ll assume all necessary data is correctly included in the modeling process—that is, every relevant piece of information is present at prediction time (inference). In real-world data science, meeting this assumption is far from straightforward; ignoring it can undermine everything discussed here. In fact, much of data science centers on fulfilling this requirement with properly formatted data. Keep in mind that input features aren’t fixed: different model architectures thrive on different data types, and some may struggle or fail entirely with certain formats—a critical consideration as hybrid numeric/language models remain in early development.
Multi-Layer Stacking
A flexible framework adaptable to time series, tabular regression, or classification problems
Layer 1
There are many ways to build ensembles, and organizing the process into layers makes sense. Layer 1 consists of your core set of base models (e.g., CatBoost, MLPs, TabPFN, etc.).
For tabular tasks, you can use bootstrap aggregation: generate new training sets by sampling with replacement from the original data, train a separate model on each, and average their predictions. You could also tune hyperparameters for each model—but this gets computationally expensive quickly, since you’d retrain models many times per sample (“bag”). To speed things up, use a hyperparameter optimizer like Optuna, which halts underperforming runs early and efficiently homes in on good settings. Alternatively, apply proven hyperparameter configurations tailored to each model and dataset type. These tuned variants can either be averaged to represent one “meta-version” of the model or kept as distinct versions for use in the next layer.
Time series forecasting complicates traditional bootstrapping because temporal order must be preserved—you can’t randomly shuffle timestamps. Instead, use rolling-window cross-validation: train a model to predict a validation window that comes strictly after its training period. After evaluation, fold that validation window into the training data and repeat for the next time slice. This gives a realistic view of how the model performs over time, though typically only the final model (trained on the latest data) is used for actual predictions. Still, earlier out-of-fold predictions can feed into the next layer.
Layer 2
Once base models are trained, you now have both training and validation performance metrics. Crucially, the test set should remain untouched during all intermediate steps. With known performance and initial predictions in hand, Layer 2 introduces smarter combination strategies.
For tabular problems, train a second round of bagged models using Layer 1 predictions as additional input features. If a base model underperfits or overfits on validation, you can exclude it at this stage.
In time series, you can’t simply add Layer 1 predictions as features across the entire training set—because early portions of the data were never predicted by later-trained models (they didn’t exist yet). One workaround: if your Layer 2 model handles missing values or you restrict training to only the subset with predictions, you can retrain using both original data and Layer 1 outputs. While feasible, more refined alternatives exist.
With performance metrics and predictions in hand, you can combine base model outputs in several smart ways:
- Simple average of all predictions
- Validation-weighted average (better-performing models get more influence)
- Ordinary least squares regression to find optimal linear weights minimizing prediction error
- Greedy ensemble: start with the top model, then incrementally add others until performance plateaus
- Train an entirely new model solely on the combined predictions as inputs
Keep in mind that the first layer’s validation windows become the second layer’s training data. That means only the very last validation window from layer 1 carries over as layer 2’s validation set. Rather than agonizing over picking a single best method, layer 2 should embrace all available strategies—these operations are lightweight from a computational standpoint.
Layer 3
Time to add yet another layer. The tabular technique produced predictions from an additional set of bagged models, while the time series approach generated outputs from various ensembling methods. Layer 3 simply takes one of the ensembling techniques described in the layer 2 time series ensembles and builds the final meta-model from it. This is the model you’d use for test set evaluation—though it’s wise to confirm it genuinely beats the base models. The top-level model will almost always come out on top, and it’s more robust against poor predictions from any one underlying model since those weak signals can be assigned lower weight and typically get smoothed out through averaging. On the flip side, when one model detects a pattern that the rest miss, the multi-layer structure can learn to boost those valuable predictions. The only scenarios where this approach falls flat are when a single model consistently outperforms all others across the board—which is uncommon—or when one or more base models perform particularly poorly, in which case they should be dropped altogether.
Was the effort justified?
In most cases, yes. The trade-off is that you end up training a large collection of models rather than just a single one. For sufficiently large datasets, both training time and inference speed can turn into real bottlenecks in certain use cases. On the bright side, this workflow is naturally suited to parallelization, and if deep learning isn’t strictly necessary, faster algorithms can take its place. LightGBM, for instance, runs roughly ten times faster than deep learning and often remains competitive in terms of performance.
This “ensemble of ensembles” philosophy in machine learning has been championed and fully embraced by AutoGluon. It’s become their go-to AutoML solution, and their team has made significant contributions to both open-source tools and cutting-edge research in the discipline. Since the pre-training frontier for tabular and time series transformers is still being charted, the introduction of new, more diverse models will likely continue to reinforce this approach.
There’s strong reason to expect this philosophy will keep delivering results, just as it has across many other domains:
- Democracy functions as an ensemble of representatives, who in turn each represent an ensemble of their constituents (at least in principle). It’s imperfect, but still the most effective governance system we’ve devised so far.
- Medical diagnoses become more accurate with multiple evaluations. Weaving together interpretations from several radiologists, pathologists, or specialists reliably lowers the rate of misdiagnosis. Each clinician may spot unique patterns or rare edge cases, and their collective assessment proves more dependable than any single opinion.
- Even stock markets act as an ensemble of predictions about the future. While market movements historically haven’t mattered to most people directly, prediction markets and forecasting platforms are shifting that reality.
- In Claude Code’s February 2026 release, Anthropic rolled out collaborative “agent teams” where several Claude instances tackle tasks together, coordinating through shared task lists and direct peer communication. xAI takes a parallel multi-agent approach with Grok 4 Heavy/Grok 4.20, where autonomous agents run independently and “cross-check” each other’s outputs before arriving at a consensus answer.
It seems teamwork really is the winning formula. Stacking ensembles of ensembles appears again and again in the most successful systems humanity has engineered—and machine learning is no different. In the era of artificial intelligence, expanding this strategy won’t be optional—it’ll be essential.



