"Crunching The Pitch: Can Algorithms Really Predict World Cup Glory?"

With the FIFA World Cup scheduled to begin on Thursday, June 11, 2026, opening at the Mexico City Stadium, I thought it would be exciting to build the best possible machine learning model to forecast match results. To achieve this, I have compiled several databases containing 49,000 matches, featuring data on Elo ratings, match outcomes, and tournament venues. Spanning from FIFA competitions to the Baltic Cup, with matches dating from 1872 to 2026, we will adopt a probabilistic approach to the beautiful game.

We will evaluate the performance of several machine learning models, including:

Multinomial regression
Multinomial ridge / elastic-net model
LightGBM

We will also analyze the strengths and weaknesses of our models to develop a well-calibrated model that correctly predicts home wins 86% of the time. By balancing model performance, calibration, and complexity, we will identify the most suitable model for our dataset.

Soccer by the Numbers

Distribution of total goals per match in the training dataset, revealing a strong concentration of matches with low goal totals and a long right tail of increasingly rare high-scoring games. Illustration by Author.

Many people claim that soccer is boring. As a devoted soccer fan, I disagree, but to be fair, there is some justification for this view. The majority of matches conclude with fewer than 5 goals, and anything exceeding 20 is an anomaly, if not practically impossible. In contrast, it is not unusual for a single player to score more than 50 points in an NBA game. Yet despite the slower pace, pubs from England to botecos in Rio remain packed.

What critics fail to appreciate is that the low-scoring nature of the game can actually make it more thrilling, as it becomes much harder for teams to build a commanding lead, keeping fans on the edge of their seats until the final whistle. Unfortunately, this also means that matches end in a draw nearly 22% of the time—which can be equally frustrating. Nevertheless, the sport continues to thrive in popularity.

Bar chart of international football matches by year before 2018, showing growth from few early matches to high annual match counts in the modern era. — Annual count of international matches in the pre-2018 training dataset, illustrating the long-term growth of international football activity from sparse early records to consistently high match volumes following the late twentieth century. Illustration by Author.

The high frequency of draws actually presents a modeling challenge that we will address later, but before we get to that, let us review how we assembled this data.

Combining the Data

Often the most effective way to enhance a model is simply to acquire more data. We will be working with international_results.csv, international_team_ratings.csv, and international_goalscorers.csv.

We need to join international_results.csv with international_team_ratings.csv so we can incorporate Elo ratings. This might seem straightforward, but as you might expect, the team names do not align perfectly, so we must resort to text processing unless we want to manually verify 336 teams individually. We also need to exercise extreme caution regarding when the Elo rating was last updated. We could use the Elo from the same day the match takes place, but that would introduce data leakage, since Elo scores are only updated after the match concludes. Using it as a feature is tempting but problematic.

We must use the most recent Elo score available prior to the match, and as an additional engineered feature, we track the time elapsed since the latest Elo update, hypothesizing that more recent ratings would be more informative than older ones. The code for merging these tables and the entire project is available in the Appendix.

Horizontal bar chart ranking international football tournaments by match count, with friendlies and FIFA World Cup qualification as the largest categories in the training dataset — Top tournaments by match count in the training dataset, highlighting the dominance of friendlies and FIFA World Cup qualification matches compared to all other international competitions. Illustration by Author.

international_results.csv

Field type	Examples
Match identity	`source_match_id`, `date`, `season`, `competition`
Teams	`home_team`, `away_team`
Final result	`home_score`, `away_score`, `match_result`, `result_class`
Context	`neutral`, `tournament`, `city`, `country`

international_team_ratings.csv

Feature	Meaning
`home_rating_pre_match`	Home team Elo before kickoff
`away_rating_pre_match`	Away team Elo before kickoff
`rating_diff`	Home Elo minus away Elo
`rating_age_days_home`	How stale the home team rating is
`rating_age_days_away`	How stale the away team rating is

international_goalscorers.csv

Feature idea	Meaning
Unique scorers in recent matches	Whether a team relies on one scorer or multiple players
Goals by top scorer	Concentration of scoring output
Recent scoring form	Attacking productivity before this match

Bar chart comparing train and test class distribution for football match results, showing shares of home wins, draws, and away wins in each dataset split. — Comparison of match-result class distributions across the training and test splits, showing broadly similar outcome proportions with home wins, draws, and away wins in each dataset. Illustration by Author.

Home victories occur most often, with away wins and draws trailing behind. Illustration by Author.

Since we’re working with a time-series forecast, it’s crucial that our data split maintains chronological order. Our model will be tested on every game from 2018 forward, totaling around 8,000 matches.

Effective split	Approximate date logic
model train	earlier portion of pre-2018 data
validation	most recent ~20% of the pre-2018 training set
test	2018 and later

Engineered Features

Grid of histograms displaying engineered football prediction feature distributions, covering prior matches, recent draw rates, goal differences, goals scored, goals conceded, and points per match. — Summary of engineered feature distributions applied during model training, including prior match counts, recent draw rates, goal-difference metrics, goals-for and goals-against rates, and points-per-match indicators across home and away team histories. Illustration by Author.

Our goal is to shift from simple match-level indicators toward more detailed pre-match features reflecting: team strength, attacking and defensive quality, home/away performance gaps, matchup balance, goalkeeper strength, and historical performance trends.

1. Draw-modeling features

The most noticeable shortcoming of our baseline multinomial logistic regression model was its inability to reliably identify draws. Even though the model had the capacity to estimate draw probability — since we defined the target variable as match_result ∈ {Home win, Draw, Away win} — draws were never predicted as the most probable outcome. The confusion matrix makes this clear, with no distinct column for draws.

Confusion matrix for a baseline football match prediction model, showing actual versus predicted home wins, draws, and away wins on the test set with row-normalized percentages. — Row-normalized test confusion matrix for the top baseline model, demonstrating that it only predicts home and away results — home wins are captured most accurately, while draws are never recognized as a distinct category. Illustration by Author.

This weak draw prediction isn’t limited to one algorithm type. When we focus on high-confidence mistakes — instances where the model’s chosen class was incorrect but carried a predicted probability of at least 0.60 — the same trend emerges across all models: they exhibit systematic overconfidence in home wins. A significant number of actual draws were incorrectly assigned a strong home-win probability, indicating the models are better at gauging which team is stronger than at assessing uncertainty or estimating draw likelihood.

Faceted bar chart of high-confidence wrong football match predictions on the test set, comparing glmnet multinomial ridge, LightGBM, and multinomial models by actual and predicted class. — Counts of high-confidence wrong predictions on the test set for Model, comparing three algorithm families, revealing that the most confident errors arise when real draws are predicted as home wins. Illustration by Author.

To tackle this “blindness” toward draws, we can design features like abs_rating_diff, home_draw_rate_last_5, form_draw_rate_mean_last_5, and binary context markers such as neutral, flag_is_world_cup, and flag_is_friendly — signaling whether the game is played on neutral ground or during the World Cup.

Feature group	Meaning	Examples
Elo closeness	Reflects how evenly matched the two sides are. Smaller gaps in ratings tend to correlate with higher draw probability.	`abs_rating_diff`
Recent draw tendency	Tracks how frequently each team’s recent matches ended in draws.	`home_draw_rate_last_5`, `away_draw_rate_last_10`
Combined draw tendency	Captures whether both teams have shown a recent pattern of drawing.	`form_draw_rate_mean_last_5`, `form_draw_rate_mean_last_10`
Match context	Tournament and venue indicators that could influence how often draws occur.	`neutral`, `flag_is_world_cup`, `flag_is_friendly`

Final LightGBM predicted probabilities grouped by outcome class. Illustration by Author.

With these added features, the model can now more effectively distinguish between home/away wins and draws, reflected in a 3.3% rise in correct draw identifications. However, this is still a modest improvement, considering roughly 20% of matches finish as draws. So the features provide some benefit, but only marginally. This hints that a separate model focused solely on draw prediction — with target variable match_result ∈ {Draw, Not Draw} — might be worth exploring, though for now we should continue constructing additional features.

¬D stands for “not Draw,” meaning our target variable becomes: match ends in a draw (1) or match does not end in a draw (0)

Confusion matrix for a LightGBM football prediction model on the test split, showing actual versus predicted home win, draw, and away win classes. — Test confusion matrix for the best LightGBM validation model. Illustration by Author.

2. Elo features

The typical team carries an Elo rating slightly above 1500 — comparable to Saudi Arabia, Iceland, or Haiti on the 2026 FIFA scale. When we plot the distributions of home wins, draws, and away wins against rating differences, a clear pattern emerges: as the gap between teams shrinks, draws become significantly more likely. Our distributions also show a slight leftward skew, confirming a moderate home-field advantage as expected.

Relying only on pre-match Elo ratings would mean missing out on potential LogLoss improvements. To fully leverage the available data, we also

Feature	Description
`home_rating_pre_match`	Home team’s Elo rating prior to kickoff.
`away_rating_pre_match`	Away team’s Elo rating prior to kickoff.
`rating_diff`	Difference between home and away team Elo ratings before kickoff. Positive values indicate a home-team advantage.
`rating_age_days_home`	Number of days since the home team’s Elo rating was last refreshed.
`rating_age_days_away`	Number of days since the away team’s Elo rating was last refreshed.

Line chart of predicted football match probabilities by rating difference, showing away win, draw, and home win probability curves. — Predicted outcome probabilities based on rating difference. Image by Author.

3. Rolling past-performance features

Some might argue that combining rolling past-performance metrics with Elo ratings is unwise, as both aim to measure team strength—potentially introducing redundancy or high correlation into the model.

While rolling performance metrics do reflect team strength, their primary purpose is to capture momentum. Winning streaks are a genuine phenomenon in sports. For instance, Spain—currently the top pick according to supercomputers—is partly favored due to their historic 31-match unbeaten run heading into FIFA 2026.

Feature group	Description	Examples
Recent points per match	Average points accumulated over each team’s last 5 or 10 matches.	`home_points_per_match_last_5`, `away_points_per_match_last_10`
Recent goal difference	Average of goals scored minus goals conceded in recent matches.	`home_goal_diff_per_match_last_5`, `away_goal_diff_per_match_last_10`
Recent draw rate	Proportion of recent matches that ended in a draw.	`home_draw_rate_last_5`, `away_draw_rate_last_10`
Home-away form differences	Gap between home and away teams on identical rolling metrics.	`form_points_diff_last_5`, `form_goal_diff_diff_last_10`
Prior match counts	Total number of past matches available for each team before the fixture.	`home_prior_matches`, `away_prior_matches`

4. Attack and defense form features

Although our model attempts to gauge attacking and defensive strength through points, it falls short compared to super-computer models. State-of-the-art methods often incorporate player-level data, which is crucial for accurately assessing team capabilities. Since we only have access to match-level data, our attack and defense features are derived from historical results—including recent scoring rates, conceding rates, scoring-rate differences, and conceding-rate differences.

Feature group	Description	Examples
Recent scoring rate	Average goals scored per game over the last 5 or 10 matches.	`home_goals_for_per_match_last_5`, `away_goals_for_per_match_last_10`
Recent conceding rate	Average goals conceded per game over the last 5 or 10 matches.	`home_goals_against_per_match_last_5`, `away_goals_against_per_match_last_10`
Scoring-rate difference	Home team’s recent scoring rate minus that of the away team.	`form_goals_for_diff_last_5`, `form_goals_for_diff_last_10`
Conceding-rate difference	Home team’s recent conceding rate minus that of the away team. Lower values suggest a defensive edge for the home side.	`form_goals_against_diff_last_5`, `form_goals_against_diff_last_10`

Correlation heatmap of numeric football model features, including rating difference, pre-match ratings, rating age, and season variables. — Heatmap showing correlations among numeric model features. Image by Author.

Grid Search

Large hyperparameter grids risk overfitting during cross-validation, and grid search complexity grows multiplicatively. To manage this, parameters are explored on a logarithmic scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2)—except for parameters like alpha, which must lie between 0 and 1.

glmnet_alpha Adjusts the balance between ridge and lasso regression in elastic net: 0 equals pure ridge, 1 equals pure lasso.

multinomial_decay Increases penalty on larger coefficients, helping prevent overfitting—but too much decay can cause underfitting.

Grid Search O(n) = number of configurations tested × time to train one model

Histogram showing prediction certainty from the final LightGBM football model, comparing correct vs incorrect calls using maximum class probability. — How confident is our final model? This chart breaks down certainty levels for correct and wrong predictions. Illustration by Author.

Model accuracy compared to average confidence levels for baseline-adjusted football models, including bin sizes and an ideal calibration reference. — How confident is our final model? This chart breaks down certainty levels for correct and wrong predictions. Illustration by Author.

Top Posts

Streamline Your Workflow: Effortlessly Combine PDFs in SMB Teams

“10 Must-Know GitHub Repositories for Mastering Python Web Development”

Morse Micro Unleashes High-Power Wi-Fi HaLow Module to Empower Long-Range IoT Designs

“Crunching the Pitch: Can Algorithms Really Predict World Cup Glory?”

Rating Gap Insights

Which Features Matter Most?

Final Thoughts

Appendix

AI Agents Outpace Traditional Search by 48x in Groundbreaking Harvard-Perplexity Study

Chromatix: A Differentiable, GPU-Accelerated Wave-Optics Library

Unlock Claude’s Full Potential: Your Definitive Blueprint for Mastering AI Skill Development with Anthropic

Here Is What the New Siri AI Could Cost You

4 Powerful Techniques to Supercharge Your Claude Code Workflow

Microsoft AI Unveils MAI-Transcribe-1.5: Record-Breaking 2.4% WER, Top FLEURS Accuracy, and 5x Faster Long-Audio Transcription

Streamline Your Workflow: Effortlessly Combine PDFs in SMB Teams

“10 Must-Know GitHub Repositories for Mastering Python Web Development”

Morse Micro Unleashes High-Power Wi-Fi HaLow Module to Empower Long-Range IoT Designs

“Crunching the Pitch: Can Algorithms Really Predict World Cup Glory?”

Whales Are Swallowing Bitcoin’s Plunge

Cybersecurity M&A in May 2026: 26 Deals Unpacked

VA EHR Expansion Accelerates: Four New Deployments Signal Nationwide Digital Health Push

Decades of Remote Work: The 7 Laptop-Bag Essentials I Never Leave Home Without

Trending

Streamline Your Workflow: Effortlessly Combine PDFs in SMB Teams

“10 Must-Know GitHub Repositories for Mastering Python Web Development”

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

“Crunching the Pitch: Can Algorithms Really Predict World Cup Glory?”

Soccer by the Numbers

Combining the Data

Engineered Features

1. Draw-modeling features

2. Elo features

3. Rolling past-performance features

4. Attack and defense form features

Grid Search

Rating Gap Insights

Which Features Matter Most?

Final Thoughts

Appendix

Related Posts