Choosing The Right Regularizer: Insights From 134,400 Simulations

Written by: Ahsaas Bajaj and Benjamin S. Knight

? We conducted 134,400 simulations using real-world ML models to uncover the answer. It hinges on your specific optimization goals and a straightforward diagnostic you can calculate before even training your model.

If you’ve ever built a linear model in scikit-learn, you’ve likely encountered this dilemma: should you choose RidgeCV, LassoCV, or ElasticNetCV? Perhaps you went with whatever a guide suggested, relied on a teammate’s recommendation, or simply tested all three and picked the one with the highest cross-validation score.

Our goal was to swap guesswork for solid, data-backed decisions.

We ran 134,400 simulations covering 960 distinct setups across a 7-part parameter landscape—adjusting factors like sample size, feature count, multicollinearity, signal-to-noise ratio, coefficient sparsity, and two additional variables. We evaluated four regularization approaches—Ridge, Lasso, ElasticNet, and Post-Lasso OLS—against three key goals:

Prediction accuracy (measured by test RMSE)
Variable selection (F1 score for identifying correct features)
Coefficient estimation (L2 error compared to true coefficients)

Our simulation settings aren’t random—they’re based on eight live production ML systems at Instacart, covering areas like demand forecasting, conversion prediction, and inventory management. The tested environments mirror real-life conditions faced by machine learning engineers daily.

This article breaks down the key takeaways from our research into a clear decision-making guide for your next project. If you’re a Data Scientist or MLE picking a regularizer, this is meant for you.

Key Takeaways

Before diving deeper:

For prediction, choice hardly matters. Ridge, Lasso, and ElasticNet show almost identical performance in median RMSE—differing by no more than 0.3%. No setting produced even a small meaningful effect on RMSE differences between them. This holds only when you have enough training data (over 78 samples per feature).
For selecting variables, the choice critically matters—especially with multicollinearity. Under high condition numbers and weak signals, Lasso’s recall drops sharply to 0.18, while ElasticNet holds strong at 0.93.
When there are plenty of samples relative to features (n/p ≥ 78), all methods perform similarly—so pick Ridge; it’s fastest.
Avoid Post-Lasso OLS if you care about RMSE. It consistently underperformed across every metric we tracked.

What We Tested and Why

Our simulation design adjusts seven core parameters at once:

Table 1: We explored a hyperparameter space consisting of 960 unique configurations.

Each of the 4 regularization methods was tested against all 960 configurations, using 35 different random seeds—resulting in a total of 134,400 simulation runs. For every run, we recorded test RMSE, F1 score (capturing both precision and recall in identifying the true set of influential features), and L2 coefficient error.

To understand why differences between methods occur, we applied omega-squared (ω²) from one-way ANOVA—an effect size measure showing how much of the performance gap variation can be tied to each input parameter. This doesn’t just tell us which method wins, but why and when.

Here’s the practical implication: most factors influencing method differences are things you can assess before fitting your model. You already know n (samples) and p (features). You can compute the condition number κ using numpy.linalg.cond(X). And the one hidden factor—SNR—has a free proxy: the α value chosen by LassoCV. A large α suggests weak signal; a small α indicates strong signal. We’ll revisit this shortly.

Finding 1: For Prediction, Just Use Ridge

This is the most actionable insight for the broadest audience.

Ridge, Lasso, and ElasticNet deliver nearly identical prediction results. Across all 33,600 simulations per method, the median test RMSE varied by only 0.3% at most. Our ω² analysis backs this up: no parameter reached even a small effect threshold (ω² ≥ 0.01) for RMSE differences among the three. All pairwise gaps were negligible (under 0.02).

If your sole focus is accuracy near-equivalence itself is the main takeaway. Sample size impacts performance far more than your choice of regularizer.

**Figure 1:** Differences in test RMSE fade away once you have sufficient data.

So why go with Ridge? Speed. Ridge uses a closed-form solution for each α candidate, making it dramatically quicker than the others (median runtime: 6 seconds for Ridge vs. 9 seconds for Lasso and 48 seconds for ElasticNet).

**Figure 2:** Expect at least a 5x slowdown when switching from Ridge or Lasso to ElasticNet.

ElasticNet’s extra cost comes from its dual search over both α and the L1 ratio ρ. The 167–219× average slowdown we observed relates directly to our 8-value L1 ratio grid. Using a simpler 3-value grid would reduce that penalty proportionally. Worse still, when coefficients are spread out uniformly, Lasso may take over an hour to finish (visible in the bimodal distribution’s right tail). All that added effort yields just a 0.04% median RMSE gain over Ridge—practically meaningless.

Important Notes

At the smallest sample size tested (n = 100), ElasticNet can outperform Ridge by 5–15%—but only in narrow cases where SNR is high (~1.0). When SNR is low, Ridge actually has a slight edge. These are edge-case observations, not broad patterns.

Also worth noting: LassoLars wasn’t included in our study, but the LARS algorithm can compute the full Lasso path in one pass (O(np²)), possibly matching Ridge’s computational efficiency. That said, LARS is known to suffer from numerical instability.

In environments with extreme multicollinearity (κ > 10⁴)—a condition typical of most real-world machine learning feature sets—our most significant results are directly applicable. This is precisely the context where our conclusions carry the greatest weight.

Key takeaway for prediction tasks: Start with RidgeCV as your default. The size of your dataset has a much bigger impact on performance than the specific regularization method chosen. However, prediction accuracy isn’t the only goal that matters. When identifying important features or obtaining precise coefficient estimates is critical—especially in the presence of multicollinearity—the optimal approach shifts considerably.

Finding 2: ElasticNet Is the Reliable Default for Feature Selection

In this area, the choice of method truly makes a difference. Feature selection—the process of determining which variables genuinely influence the outcome—is the task most affected by the type of regularization used, and where poor choices lead to the most serious consequences.

What Creates the Performance Gaps

Based on our ANOVA analysis comparing F1 score differences between methods:

**Table 2:** Dataset size is the strongest factor explaining variations in F1 scores.

Sample size is by far the most influential factor. However, once you’re working with limited data (n/p < 78), the condition number and signal-to-noise ratio become the main drivers of performance differences.

Severe Multicollinearity (κ > ~10⁴): Steer Clear of Lasso

This stands as one of the most consistent findings throughout our research, and it has direct implications for applied ML work. Seven out of the eight models we examined fall into the high-κ category. If your features show even moderate correlation—which is nearly guaranteed in any curated feature set—this result is relevant to your situation.

Under high κ combined with weak signals:

Lasso recall: 0.18 (it overlooks 82% of genuinely important features)
ElasticNet recall: 0.93 (it identifies 93% of genuinely important features)

That translates to a fivefold recall improvement with ElasticNet. The underlying mechanism is well-documented. When features are strongly correlated, Lasso tends to randomly select one feature from each correlated group while eliminating the others. ElasticNet’s L2 regularization component—the so-called “grouping effect” first outlined by Zou and Hastie (2005)—preserves correlated features as a group.

Our simulations confirm this isn’t an edge case. The largest F1 gaps (ΔF1 between 0.50 and 0.75) are concentrated in the high-κ columns at n = 100 and n = 1,000. This represents the typical scenario in production environments.

Mild Multicollinearity (κ < ~10²): ElasticNet Remains the Safer Choice

One might assume Lasso would perform best when κ is low. In practice, it doesn’t—at least not consistently. Even with low κ, Lasso’s recall fluctuates significantly depending on the signal-to-noise ratio (discussed further below).

**Figure 3:** ElasticNet’s L2 component shields against the recall deterioration that often affects Lasso.

ElasticNet sustains recall at or above 0.91 across all SNR levels, even when κ is low. Lasso only becomes competitive when both the SNR is strong and the underlying model is truly sparse. Since SNR is rarely known beforehand, ElasticNet is the more dependable option.

The Unexpected Ridge Result

This was surprising: Ridge often achieves the best F1 scores with small sample sizes, despite never conducting explicit feature selection. Why? Ridge’s recall is perpetually 1.0 because it keeps every feature, and that flawless recall outweighs the precision gains of sparse methods when those methods’ recall drops under low SNR conditions.

However, this doesn’t constitute true variable selection. Ridge assigns a nonzero weight to every feature. If you require a genuinely sparse model, Ridge won’t deliver. Pairing Ridge with post-hoc permutation importance testing is a logical next step, though we didn’t explore that approach in this study.

Feature Selection: Key Takeaways

**Figure 4:** ElasticNet is the prudent choice when the analyst cannot confidently estimate SNR.

Key takeaway for feature selection: ElasticNetCV should be your go-to default. Lasso is only justified when κ is low, SNR is strong, and you have domain-specific reasons to believe the true model is sparse.

Finding 3: For Accurate Coefficient Estimation, Let κ Guide Your Choice

When the objective is recovering precise coefficient values—for model interpretation or causal analysis—the condition number κ serves as the primary decision criterion. Ideally, we would base this decision on the distribution of the true β coefficients, but those are unobservable. By contrast, κ can be directly calculated from your data. When κ is high, ElasticNet outperforms alternatives regardless of sparsity. When κ is low, the best method hinges on whether the true model is sparse or dense. Sample size affects the scale of performance differences but not their direction.

High κ (> ~10⁴): Choose ElasticNet. It delivers 20–40% lower L2 coefficient error than Lasso and maintains a steady advantage over Ridge irrespective of sparsity.

Low κ (< ~10²): Let your domain expertise about sparsity guide the decision.

Sparse settings (genomics, text classification, sensor networks): Lasso or ElasticNet
Dense settings (engineered feature sets, demand forecasting, conversion modeling): Ridge

**Figure 5:** Ridge’s edge over Lasso/ElasticNet diminishes rapidly as the n/p ratio grows, while a well-conditioned eigenspace further favors Lasso/ElasticNet.

Across all scenarios: Skip Post-Lasso OLS. It produces higher coefficient L2 error than standard Lasso throughout the entire simulation space. The unpenalized OLS refitting stage magnifies errors from the initial selection step. This is precisely the situation where a two-stage approach would seem beneficial, yet it fails to deliver.

**Figure 6:** When coefficient estimation is the goal, Ridge becomes a more specialized tool.

Key takeaway for coefficient estimation: Use ElasticNet when κ is high; at low κ, base your choice on domain knowledge about sparsity.

low κ, never Post-Lasso OLS.

A Practitioner’s Decision Guide

The findings above boil down to a practical decision framework based entirely on quantities you can calculate before training any model: the sample-to-feature ratio n/p, the condition number κ (computed via numpy.linalg.cond(X)), and when more precision is needed, the regularization parameter α selected by a quick LassoCV run as a stand-in for the underlying SNR.

The complete flowchart appears in our paper (Figure 7). Below, we walk through the logic step by step as a decision tree.

The under-determined regime

When the number of features exceeds the number of samples, you’re in the under-determined regime. In this setting, Lasso’s α often hits the top of the search grid, and its recall drops sharply. Default to Ridge or ElasticNet regardless of your goal, and proceed carefully.

The large-sample regime

When n/p ≥ 78, you’re in the large-sample regime where all methods converge. Differences in performance across prediction, variable selection, and coefficient estimation all but disappear.

Use RidgeCV. It’s the fastest approach by a wide margin, with no sacrifice in accuracy. If you specifically need a sparse model for interpretability, ElasticNetCV or LassoCV work just fine at this ratio. The choice among them is essentially a wash.

The regime where choice matters

Below n/p = 78 is where your choice of method has the biggest impact. The right regularizer depends on your primary objective.

If prediction is your priority: Use RidgeCV. The RMSE differences among the three core methods are too small to justify the added complexity or computation. One narrow exception: at n ≈ 100 with high SNR (~1.0), ElasticNet delivers a measurable 5–15% improvement regardless of κ; at n ≈ 100 with very low SNR, Ridge has a slight edge. In either case, the gain is modest compared to what you’d get from simply increasing your sample size.

If variable selection is your priority: Split on the condition number.

κ > ~10⁴ (high multicollinearity): Use ElasticNetCV. This is one of the clearest recommendations in the study. One nuance: at moderate-to-high SNR (or n ≥ 1,000), ElasticNet is the obvious choice, with F1 advantages over Lasso reaching ΔF1 of +0.75. At very low SNR with n ≈ 100 (identified by a saturated CV-selected α), Ridge achieves the highest F1, but only through perfect recall (keeping all features), not genuine variable selection. If you need an explicitly sparse model even in this corner, ElasticNet remains the least-bad option and still vastly outperforms Lasso.
κ < ~10² (well-conditioned): A word of caution first: don’t default to Lasso just because κ is low. Lasso’s recall drops sharply at lower SNR levels regardless of multicollinearity, while ElasticNet maintains recall ≥ 0.91 across all SNR levels. ElasticNet is the safe default here. To refine further, run a quick LassoCV and check the selected α. If α is high or hits the grid boundary, you’re in a low-SNR regime. Ridge gives the best F1 (though not through true sparsification). If α is moderate, stick with ElasticNet. If α is low and domain knowledge supports sparsity, Lasso becomes a reasonable choice.

If coefficient estimation is your priority: Split on the condition number.

κ > ~10⁴: ElasticNetCV dominates regardless of sparsity.
κ < ~10²: Lean on domain knowledge. Sparse model → Lasso. Dense model → Ridge.

The α Diagnostic: A Free SNR Proxy

The one hidden parameter that matters for fine-grained decisions—signal-to-noise ratio—can be estimated at no extra cost. When scikit-learn’s LassoCV fits your data, it returns the selected α. This value is inversely related to the underlying SNR: high α means weak signal, low α means strong signal.

Our simulations provide direct empirical confirmation: the highest selected α values (approaching 10⁴–10⁵) occur exclusively in small-n, low-SNR configurations.

**Figure 7:** The regularization parameter α can serve as a useful proxy for SNR.

These thresholds are rough heuristics drawn from our simulation grid—they’ll shift with feature scaling and dataset characteristics. Treat them as rules of thumb, not hard boundaries.

In All Uncertain Cases

When you’re unsure about SNR, unsure about sparsity, or working in the intermediate-κ range we didn’t directly test: ElasticNet is the default that won’t steer you wrong, and Post-Lasso OLS should be avoided.

The Meta-Finding: Sample Size Trumps Everything

One insight outweighs all method-level advice: increasing your sample-to-feature ratio helps every objective more than any choice of regularizer.

Sample size is the dominant driver of performance differences across all three metrics (ω² = 0.308 for F1, a large effect). The n × SNR interaction is the strongest two-way interaction across all comparisons (F = 569, p < 0.001). Signal-to-noise matters most precisely when samples are scarce. And at n/p ≥ 78, method choice becomes irrelevant entirely.

If you’re spending days fine-tuning your regularizer when you could be expanding your training set, you’re focusing on the wrong lever.

Quick Reference

**Table 3:** The most appropriate regularizer is determined by both the nature of the feature data and the research objective.

Putting It Into Practice

The simulation framework is a reusable harness. We capped sample sizes at 100k observations for computational reasons, but the grid still spans the n/p inflection point where regularizer performance shifts. We’re now extending it to newer regularizers (Adaptive Lasso, SCAD, MCP) and intermediate κ levels.

To apply this framework to your next project, compute three quantities before fitting anything: the sample-to-feature ratio (n/p), the condition number (κ), and if you’re in the small-n regime, a quick LassoCV α as your SNR proxy. Route through the decision guide above based on your primary objective.

If n/p ≥ 78, use Ridge and spend your tuning budget elsewhere. If n/p < 78 and κ is high, use ElasticNet and don’t second-guess it. The only scenario where the choice requires real deliberation is low κ with small n, and even there, ElasticNet is never a bad answer.

The full paper, including all appendix figures, ANOVA tables, and the consolidated decision flowchart, is available on ArXiv.

Ahsaas Bajaj is a Machine Learning Tech Lead at Instacart. Benjamin S Knight is a Staff Data Scientist at Instacart.

All images were created by the authors.

Top Posts

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Dual-Parameter Phase Stability Control for Autonomous Mobile Robots

Choosing the Right Regularizer: Insights from 134,400 Simulations

Disneyland Scans Your Face to Enter the Magic Kingdom

NVIDIA’s Speculative Decoding Innovation Hits 1.8× Faster Rollouts at 8B, Eyes 2.5× Total Speedup by Scale

Mastering Chaos: How the Modern Data Scientist Thrives on Messy Data with Pingouin

The Secret Sauce Behind ZDNET’s AI Testing Process

“Securing Profits Through AI Governance in the Enterprise”

Proxy-Pointer RAG: Turning Text into Cross-Modal Answer Maps

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Dual-Parameter Phase Stability Control for Autonomous Mobile Robots

Choosing the Right Regularizer: Insights from 134,400 Simulations

Clear Crypto Act Text: Firms Can Offer Stablecoin Rewards & Shield Bank Yield

Decoding Agent Minds: Parsing, Analyzing, and Fine-Tuning Reasoning Traces from the lambda/hermes-agent-reasoning-traces Dataset

“Shadow Network: China-Backed Cyber Espionage Campaign Strikes Asian Governments, NATO Allies, and Global Press”

Fixed-Price Contracts Gain Momentum as Stakeholder Accountability Takes Center Stage

Trending

**Rising Tide: OT Cybersecurity Threats Breaching Critical Infrastructure

30 Laptop-Free Days: The Unfiltered Truth About Real Productivity

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Choosing the Right Regularizer: Insights from 134,400 Simulations

Key Takeaways

What We Tested and Why

Finding 1: For Prediction, Just Use Ridge

Important Notes

Finding 2: ElasticNet Is the Reliable Default for Feature Selection

What Creates the Performance Gaps

Severe Multicollinearity (κ > ~10⁴): Steer Clear of Lasso

Mild Multicollinearity (κ < ~10²): ElasticNet Remains the Safer Choice

The Unexpected Ridge Result

Feature Selection: Key Takeaways

Finding 3: For Accurate Coefficient Estimation, Let κ Guide Your Choice

A Practitioner’s Decision Guide

The under-determined regime

The large-sample regime

The regime where choice matters

The α Diagnostic: A Free SNR Proxy

In All Uncertain Cases

The Meta-Finding: Sample Size Trumps Everything

Quick Reference

Putting It Into Practice

Related Posts