With the FIFA World Cup scheduled to begin on Thursday, June 11, 2026, opening at the Mexico City Stadium, I thought it would be exciting to build the best possible machine learning model to forecast match results. To achieve this, I have compiled several databases containing 49,000 matches, featuring data on Elo ratings, match outcomes, and tournament venues. Spanning from FIFA competitions to the Baltic Cup, with matches dating from 1872 to 2026, we will adopt a probabilistic approach to the beautiful game.
We will evaluate the performance of several machine learning models, including:
- Multinomial regression
- Multinomial ridge / elastic-net model
- LightGBM
We will also analyze the strengths and weaknesses of our models to develop a well-calibrated model that correctly predicts home wins 86% of the time. By balancing model performance, calibration, and complexity, we will identify the most suitable model for our dataset.
Soccer by the Numbers
Many people claim that soccer is boring. As a devoted soccer fan, I disagree, but to be fair, there is some justification for this view. The majority of matches conclude with fewer than 5 goals, and anything exceeding 20 is an anomaly, if not practically impossible. In contrast, it is not unusual for a single player to score more than 50 points in an NBA game. Yet despite the slower pace, pubs from England to botecos in Rio remain packed.
What critics fail to appreciate is that the low-scoring nature of the game can actually make it more thrilling, as it becomes much harder for teams to build a commanding lead, keeping fans on the edge of their seats until the final whistle. Unfortunately, this also means that matches end in a draw nearly 22% of the time—which can be equally frustrating. Nevertheless, the sport continues to thrive in popularity.

The high frequency of draws actually presents a modeling challenge that we will address later, but before we get to that, let us review how we assembled this data.
Combining the Data
Often the most effective way to enhance a model is simply to acquire more data. We will be working with international_results.csv, international_team_ratings.csv, and international_goalscorers.csv.
We need to join international_results.csv with international_team_ratings.csv so we can incorporate Elo ratings. This might seem straightforward, but as you might expect, the team names do not align perfectly, so we must resort to text processing unless we want to manually verify 336 teams individually. We also need to exercise extreme caution regarding when the Elo rating was last updated. We could use the Elo from the same day the match takes place, but that would introduce data leakage, since Elo scores are only updated after the match concludes. Using it as a feature is tempting but problematic.
We must use the most recent Elo score available prior to the match, and as an additional engineered feature, we track the time elapsed since the latest Elo update, hypothesizing that more recent ratings would be more informative than older ones. The code for merging these tables and the entire project is available in the Appendix.

international_results.csv
| Field type | Examples |
|---|---|
| Match identity | source_match_id, date, season, competition |
| Teams | home_team, away_team |
| Final result | home_score, away_score, match_result, result_class |
| Context | neutral, tournament, city, country |
international_team_ratings.csv
| Feature | Meaning |
|---|---|
home_rating_pre_match | Home team Elo before kickoff |
away_rating_pre_match | Away team Elo before kickoff |
rating_diff | Home Elo minus away Elo |
rating_age_days_home | How stale the home team rating is |
rating_age_days_away | How stale the away team rating is |
international_goalscorers.csv
| Feature idea | Meaning |
|---|---|
| Unique scorers in recent matches | Whether a team relies on one scorer or multiple players |
| Goals by top scorer | Concentration of scoring output |
| Recent scoring form | Attacking productivity before this match |


Home victories occur most often, with away wins and draws trailing behind. Illustration by Author.
Since we’re working with a time-series forecast, it’s crucial that our data split maintains chronological order. Our model will be tested on every game from 2018 forward, totaling around 8,000 matches.
| Effective split | Approximate date logic |
|---|---|
| model train | earlier portion of pre-2018 data |
| validation | most recent ~20% of the pre-2018 training set |
| test | 2018 and later |
Engineered Features

Our goal is to shift from simple match-level indicators toward more detailed pre-match features reflecting: team strength, attacking and defensive quality, home/away performance gaps, matchup balance, goalkeeper strength, and historical performance trends.
1. Draw-modeling features
The most noticeable shortcoming of our baseline multinomial logistic regression model was its inability to reliably identify draws. Even though the model had the capacity to estimate draw probability — since we defined the target variable as match_result ∈ {Home win, Draw, Away win} — draws were never predicted as the most probable outcome. The confusion matrix makes this clear, with no distinct column for draws.

This weak draw prediction isn’t limited to one algorithm type. When we focus on high-confidence mistakes — instances where the model’s chosen class was incorrect but carried a predicted probability of at least 0.60 — the same trend emerges across all models: they exhibit systematic overconfidence in home wins. A significant number of actual draws were incorrectly assigned a strong home-win probability, indicating the models are better at gauging which team is stronger than at assessing uncertainty or estimating draw likelihood.

To tackle this “blindness” toward draws, we can design features like abs_rating_diff, home_draw_rate_last_5, form_draw_rate_mean_last_5, and binary context markers such as neutral, flag_is_world_cup, and flag_is_friendly — signaling whether the game is played on neutral ground or during the World Cup.
| Feature group | Meaning | Examples |
|---|---|---|
| Elo closeness | Reflects how evenly matched the two sides are. Smaller gaps in ratings tend to correlate with higher draw probability. | abs_rating_diff |
| Recent draw tendency | Tracks how frequently each team’s recent matches ended in draws. | home_draw_rate_last_5, away_draw_rate_last_10 |
| Combined draw tendency | Captures whether both teams have shown a recent pattern of drawing. | form_draw_rate_mean_last_5, form_draw_rate_mean_last_10 |
| Match context | Tournament and venue indicators that could influence how often draws occur. | neutral, flag_is_world_cup, flag_is_friendly |

With these added features, the model can now more effectively distinguish between home/away wins and draws, reflected in a 3.3% rise in correct draw identifications. However, this is still a modest improvement, considering roughly 20% of matches finish as draws. So the features provide some benefit, but only marginally. This hints that a separate model focused solely on draw prediction — with target variable match_result ∈ {Draw, Not Draw} — might be worth exploring, though for now we should continue constructing additional features.
¬D stands for “not Draw,” meaning our target variable becomes: match ends in a draw (1) or match does not end in a draw (0)

2. Elo features
The typical team carries an Elo rating slightly above 1500 — comparable to Saudi Arabia, Iceland, or Haiti on the 2026 FIFA scale. When we plot the distributions of home wins, draws, and away wins against rating differences, a clear pattern emerges: as the gap between teams shrinks, draws become significantly more likely. Our distributions also show a slight leftward skew, confirming a moderate home-field advantage as expected.


Relying only on pre-match Elo ratings would mean missing out on potential LogLoss improvements. To fully leverage the available data, we also
| Feature | Description |
|---|---|
home_rating_pre_match | Home team’s Elo rating prior to kickoff. |
away_rating_pre_match | Away team’s Elo rating prior to kickoff. |
rating_diff | Difference between home and away team Elo ratings before kickoff. Positive values indicate a home-team advantage. |
rating_age_days_home | Number of days since the home team’s Elo rating was last refreshed. |
rating_age_days_away | Number of days since the away team’s Elo rating was last refreshed. |

3. Rolling past-performance features
Some might argue that combining rolling past-performance metrics with Elo ratings is unwise, as both aim to measure team strength—potentially introducing redundancy or high correlation into the model.
While rolling performance metrics do reflect team strength, their primary purpose is to capture momentum. Winning streaks are a genuine phenomenon in sports. For instance, Spain—currently the top pick according to supercomputers—is partly favored due to their historic 31-match unbeaten run heading into FIFA 2026.
| Feature group | Description | Examples |
|---|---|---|
| Recent points per match | Average points accumulated over each team’s last 5 or 10 matches. | home_points_per_match_last_5, away_points_per_match_last_10 |
| Recent goal difference | Average of goals scored minus goals conceded in recent matches. | home_goal_diff_per_match_last_5, away_goal_diff_per_match_last_10 |
| Recent draw rate | Proportion of recent matches that ended in a draw. | home_draw_rate_last_5, away_draw_rate_last_10 |
| Home-away form differences | Gap between home and away teams on identical rolling metrics. | form_points_diff_last_5, form_goal_diff_diff_last_10 |
| Prior match counts | Total number of past matches available for each team before the fixture. | home_prior_matches, away_prior_matches |
4. Attack and defense form features
Although our model attempts to gauge attacking and defensive strength through points, it falls short compared to super-computer models. State-of-the-art methods often incorporate player-level data, which is crucial for accurately assessing team capabilities. Since we only have access to match-level data, our attack and defense features are derived from historical results—including recent scoring rates, conceding rates, scoring-rate differences, and conceding-rate differences.
| Feature group | Description | Examples |
|---|---|---|
| Recent scoring rate | Average goals scored per game over the last 5 or 10 matches. | home_goals_for_per_match_last_5, away_goals_for_per_match_last_10 |
| Recent conceding rate | Average goals conceded per game over the last 5 or 10 matches. | home_goals_against_per_match_last_5, away_goals_against_per_match_last_10 |
| Scoring-rate difference | Home team’s recent scoring rate minus that of the away team. | form_goals_for_diff_last_5, form_goals_for_diff_last_10 |
| Conceding-rate difference | Home team’s recent conceding rate minus that of the away team. Lower values suggest a defensive edge for the home side. | form_goals_against_diff_last_5, form_goals_against_diff_last_10 |

Grid Search
Large hyperparameter grids risk overfitting during cross-validation, and grid search complexity grows multiplicatively. To manage this, parameters are explored on a logarithmic scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2)—except for parameters like alpha, which must lie between 0 and 1.
glmnet_alphaAdjusts the balance between ridge and lasso regression in elastic net: 0 equals pure ridge, 1 equals pure lasso.
multinomial_decayIncreases penalty on larger coefficients, helping prevent overfitting—but too much decay can cause underfitting.
Grid Search O(n) = number of configurations tested × time to train one model
| Model family | Grid/configurations shown | What was tuned | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baselines | majority_baseline, frequency_baseline, rating_diff_multinom | Mostly untuned; used as comparison baselines | ||||||||||||||||||||||
| glmnet | alpha = 0, .25, .5, .75, 1 | Elastic-net mixing parameter | ||||||||||||||||||||||
| multinom | decay = 0, 1e-5, 1e-4, 1e-3, 1e-2 | L2 weight decay / coefficient shrinkage | ||||||||||||||||||||||
| LightGBM | less_regular, deeper, more_regular, current_final, l2_regularized, shallower, l1_l2_regularized, compact_robust, faster_small, slower_small | Named bundles of tree-depth, learning-rate, boosting-round, and regularizationI’ll paraphrase the text content while keeping the HTML structure intact.
![]() Our adjusted models show solid real-world reliability on the test data. When grouped by confidence levels, the accuracy we see in practice closely matches what the models predict. If a model says it’s 60% sure, it’s right about 60% of the time. When it’s more certain, the hit rate climbs even higher. The small gap from a “perfect” calibration line tells us these probability scores are genuinely useful tools, not just for ranking outcomes, but for measuring actual likelihood. Keep in mind: this overall calibration focuses only on the top prediction—the outcome with the highest score. It doesn’t verify whether home wins, draws, or away wins are predicted accurately on their own. A model can seem well-calibrated overall while still underperforming drastically in one category, especially draws. This graph confirms you can generally trust the final confidence score, but it doesn’t prove draw odds are equally accurate. ![]() Looking at individual outcomes, we see that home and away win predictions are spot-on. Their accuracy tracks nearly perfectly with confidence across most bins. When the model gives a high score to a home or away win, that result usually follows through. Essentially, these figures function as real probabilities, not just abstract scores. ![]() Draws stand out as the exception. While the model gauges them reasonably within its typical range, that range is very limited. Even in tight matches, the model rarely gives draws a high probability, usually keeping them between low-to-mid figures. Here’s the core issue: the model doesn’t overlook draws—it consistently treats them as risks rather than likely winners. These probabilities might help measure draw danger, but draws almost never become the model’s top pick. This explains why the model struggles so much with predicting draws correctly. ![]() Rating Gap InsightsThis analysis highlights why draws are inherently tough for the model to crack. Real-world draw rates peak when teams are evenly matched and drop sharply as the Elo gap widens. All three model versions recognize this trend—they lower draw probabilities for lopsided matchups. The issue isn’t direction but degree. In closely fought games, actual draw rates hit around 33%, yet models assign roughly 25%. They get the pattern right—acknowledging higher draw risks in tight matches—but don’t push high enough. So while the model senses draw danger, it rarely elevates draws to top-prediction status. This bridges the gap between decent calibration and weak recall: the probabilities trend correctly but fall short of triggering a draw prediction via argmax (choosing the highest-scored outcome). ![]() Which Features Matter Most?As suspected, rating difference dominates all other factors in predicting results. Whether a game is played on neutral ground comes in a distant second. Reviewing feature importance reveals which custom-built inputs actually delivered useful signals. ![]() ![]() Final ThoughtsNow’s a perfect time to discuss data volume and model options. Usually, bigger, more complex datasets call for more advanced models. However, jumping from simple regression to LightGBM barely improved results here. A key warning: going even more complex likely won’t help much with this data. Predicting football isn’t about finding a secret algorithm—it’s about crafting clean features without data leaks, testing against clear baselines, and verifying if confidence scores hold up. For now, one thing stands out: we need significantly more data for better forecasts. Specifically, player-level insights—like knowing Neymar is benched—are crucial. Finer detail is also essential if we want to adjust predictions mid-match. AppendixThe full project code is available on my GitHub.
Website | LinkedIn | GitHub ![]() |











