All code referenced in this section can be found on GitHub. The core business logic and modeling functions reside in the
src/selectiondirectory, within this specific file:
src/selection/logit_model_selection.pyThe associated analysis and findings are recorded in:
08_logistic_model_selection.qmd
It has become far simpler to produce code, streamline model training, benchmark metrics, and compile summary reports. With a handful of well-crafted prompts, a data scientist can now draft Python scripts, fit logistic regression models, calculate AUC and Gini scores, create visualizations, and log the outcomes.
However, this convenience introduces a significant pitfall.
A scoring model is more than just code that executes without errors. It is not merely the model that achieves the highest accuracy on the training set. Within a professional credit risk setting, a scoring model must be statistically rigorous, resilient over time, transparent in its logic, aligned with business requirements, and straightforward to oversee once deployed.
This piece is one installment in a larger series focused on constructing scoring models that are robust, transparent, and stable. Earlier installments addressed the critical preliminary steps: assembling datasets, conducting exploratory data analysis, refining variables, pre-screening predictors, evaluating temporal consistency, benchmarking development against validation samples, and binning continuous variables.
We now shift our focus to a pivotal phase: training multiple candidate models and choosing the definitive one.
The aim here is to outline a structured approach for training various scoring models, benchmarking their effectiveness, verifying their stability, and ultimately picking a final model grounded in statistical, business, and operational considerations.
AI assistants like ChatGPT, Codex, and GitHub Copilot can streamline code generation, automate modeling workflows, execute statistical checks, compile summary tables, and document findings. For this particular study, we will leverage Codex and evaluate its proficiency in handling each of these responsibilities.
The article is structured into three segments. First, we detail the datasets employed in the modeling exercise. Second, we outline the methodology for training and appraising candidate models. Third, we describe the process for interpreting results and making the final model selection.
The Datasets
To demonstrate this essential phase, we utilize a publicly available dataset from Kaggle: the Credit Scoring Dataset. This collection comprises 32,581 records and 12 attributes detailing loans extended by a bank to individual clients.
Across this series, we have implemented various processing techniques on these attributes to identify the most suitable candidate variables for the final model, adhering to both statistical and regulatory standards.
In this specific case, the variables that survived the preselection phase are all categorical. The majority feature two or three distinct categories. This aligns with our earlier methodology, where continuous variables were converted into discrete bins to enhance transparency and simplify the interpretation of the final score.
The selected variables are:
These serve as the explanatory variables, represented as . Here, q equals 6.
The target variable, labeled Y, indicates the occurrence of default. Specifically, it maps to the loan_status field. It is defined as follows:
The objective is to calculate the likelihood of default based on the observed borrower characteristics:
The resulting score is then derived by transforming this calculated probability. For logistic regression, this transformation utilizes the logit function.
The dataset is partitioned into three distinct subsets.
The training subset is utilized to calculate the parameters for the candidate models. In our scenario, this is further segmented into four folds to verify the consistency of the models across various data partitions.
The test subset is employed to measure model performance on data that was not used to train the coefficients. This confirms whether the model can effectively generalize to a population similar to the development set.
The out-of-time subset is used to gauge stability across different periods. This is a critical factor in credit scoring. A model must not only excel during its development phase but must also maintain its reliability when applied to a subsequent timeframe.
This separation is vital because a model might appear highly effective on training data yet perform poorly on the out-of-time subset. Such a decline suggests the model may be overfitted or overly reliant on the specific conditions of the development period.
Reformulating the Scoring Problem
A scoring model quantifies the relationship between a binary target variable
Consider a target variable along with a collection of explanatory variables .
For every individual i, the model generates a score derived from the estimated likelihood of default:
In credit scoring, the score must effectively rank borrowers according to their risk level. A well-performing model should, on average, assign higher risk scores to borrowers who default and lower risk scores to those who do not.
This ranking capability is precisely why discrimination metrics like AUC and Gini are so important in scoring. However, strong discrimination alone is insufficient. A model may exhibit excellent predictive accuracy while remaining unstable, hard to interpret, or misaligned with business logic.
For this reason, the final model should be chosen based on multiple criteria rather than relying on a single performance measure.
Why Logistic Regression Remains the Benchmark
Given that the target variable is binary, logistic regression serves as a natural benchmark. It expresses the log-odds of default as a linear function of the explanatory variables:
Logistic regression offers several strengths in a scoring environment. It is purpose-built for binary outcomes, yields coefficients that are easy to interpret, enables analysts to confirm the direction of risk, and is widely understood by statistical, business, and IT teams alike. It is also relatively straightforward to deploy in production.
In today’s era of artificial intelligence, there may be a temptation to jump straight to more sophisticated models such as random forests, gradient boosting, or neural networks. These approaches can sometimes achieve superior raw performance.
Yet in credit scoring, raw performance is not the sole goal. The model must also be explainable, well-documented, stable over time, and consistent with business expectations. For these reasons, logistic regression continues to be a strong reference point and, in many cases, the preferred model for production use.
Artificial intelligence can speed up the modeling workflow, but it does not alter the fundamental requirements of a professional scoring model.
Preparing Categorical Variables
Because the explanatory variables are categorical, they need to be transformed before being fed into logistic regression.
Each categorical variable is encoded into dummy variables. If a variable has n categories, it is represented by n – 1 binary indicators, with one category designated as the reference.
This prevents perfect multicollinearity among the categories. The resulting coefficients are then interpreted relative to the reference category.
For instance, imagine a variable with three categories: A, B, and C. If A is chosen as the reference, the model estimates one coefficient for B and one for C. These coefficients capture the difference in risk between B and A, and between C and A.
Under this methodology, the reference category is selected as the least risky option—that is, the category with the lowest default rate in the training sample. This choice simplifies interpretation: positive coefficients signal higher risk compared to the safest category.
Training Candidate Models
Following variable preselection, all meaningful combinations of candidate variables are tested.
The aim is not simply to find the model with the highest training performance. Instead, the goal is to identify a model that meets several key requirements:
- Statistical validity;
- Business consistency;
- Adequate discriminatory power;
- Stability across different samples;
- A manageable number of variables;
- Low multicollinearity;
- Clear interpretability.
For each variable combination, a logistic regression is fitted on the training sample and evaluated across the validation folds.
Every candidate model is assessed using four categories of criteria: statistical validation, predictive performance, stability, and interpretability.
This process can be largely automated with the help of artificial intelligence. An AI coding assistant can assist in generating loops over variable combinations, fitting models, storing coefficients, computing metrics, and producing comparison tables.
Statistical Validation Criteria
The first level of assessment focuses on statistical validity.
Global Significance
To evaluate a model’s overall importance, a likelihood ratio test is used. This method pits the complete model against a basic null model that contains only an intercept.
The goal is to confirm whether the set of explanatory variables provides meaningful information for predicting the target variable.
If a model fails to outperform the null model in a significant way, it should be discarded, even if certain descriptive statistics seem satisfactory.
Individual Significance
The significance of individual variables is determined by examining their coefficients and related statistical tests, like Wald tests, likelihood ratio tests, or p-values.
According to this methodology, chosen variables must meet a 5% significance threshold. It’s also important to review the modalities to confirm that each selected variable makes a meaningful contribution to distinguishing risk.
This step is crucial because a variable might seem useful as a whole, yet some of its specific categories could be weak, unstable, or hard to interpret.
Direction of Risk
Statistical significance alone is insufficient. The coefficients must also align with business logic.
If a particular category is expected to indicate higher risk, its coefficient should reflect an increased probability of default compared to the reference group.
A model can be statistically robust but hard to defend if the risk direction contradicts economic or business reasoning. In professional scoring, such inconsistencies must be thoroughly examined before the model is approved.
Multicollinearity
Multicollinearity can lead to unstable and hard-to-interpret coefficient estimates. It is typically evaluated using the Variance Inflation Factor (VIF).
In this methodology, accepted models must meet the following condition:
VIF < 10
Since the variables are categorical, the VIF is computed using the dummy variables, leaving out the reference categories. For each categorical variable, a simple status is assigned:
OKif all categories meet the VIF requirement;KOif any category has a VIF >= 10.
This guideline helps filter out models where explanatory variables are excessively redundant.
Goodness of Fit
Goodness of fit can be evaluated using tests like the Hosmer-Lemeshow test. This test compares predicted probabilities against observed default rates across different risk groups.
It should not be used in isolation, but it can offer valuable insights into model calibration.
In this application, the Hosmer-Lemeshow test is not directly applied. Our Python workflow does not depend on a standard built-in function for this test. Therefore, it should either be coded manually, implemented with a verified external function, or managed in another statistical environment. A separate article will address this topic in detail.
Performance Metrics
Model performance is assessed from two angles.
The first angle evaluates discrimination: the model’s capacity to differentiate between borrowers who default and those who do not. This is measured by the ROC curve, AUC, and Gini.
The second angle deals with class imbalance and the accuracy of predicting the positive class. This is measured by recall, precision, F1-score, and PR-AUC.
ROC Curve, AUC, and Gini
The ROC curve illustrates the trade-off between the true positive rate and the false positive rate at various classification thresholds.
The true positive rate, also known as recall, is calculated as:
It indicates the proportion of actual defaults that the model correctly identifies.
The false positive rate is calculated as:
It indicates the proportion of non-defaulting borrowers that the model incorrectly flags as defaults.
The AUC, or Area Under the Curve, summarizes the ROC curve. An AUC closer to 1 signifies a better model at ranking risky versus non-risky borrowers. An AUC near 0.5 suggests performance similar to random guessing.
The Gini index is a common transformation of AUC used in credit scoring:
A Gini of 0 represents random performance. A higher Gini value indicates stronger discriminatory ability.
Recall, Precision, and F1-Score
When the target variable is imbalanced, it’s beneficial to supplement AUC and Gini with metrics that focus on the default class.
Recall measures the proportion of actual defaults that are correctly identified:
Precision measures the proportion of predicted defaults that are actually defaults:
The F1-score combines precision and recall using a harmonic mean:
This measure becomes valuable when we must strike a balance between catching actual defaults and keeping false alarms to a minimum.
Precision-Recall AUC
The Precision-Recall curve traces how precision changes as recall varies across different decision thresholds. It shines especially when defaults are uncommon in the dataset.
The PR-AUC value should be evaluated against the baseline default rate in the sample. A competent model should typically achieve a PR-AUC that exceeds the observed default rate.
Conditional Score Distributions
Numerical measures alone aren’t enough — visual diagnostics should accompany them.
Looking at how scores are distributed separately for borrowers who defaulted versus those who didn’t reveals how well the model distinguishes between the two groups.
A strong model should produce clearly separated score distributions. When the two distributions overlap heavily, the model’s ability to discriminate is weak, even if certain summary metrics look reasonable.
Stability Criteria
Choosing a scoring model shouldn’t rest solely on how well it performs on training data. It needs to hold up consistently across different data samples.
For this reason, performance is benchmarked across:
- the training sample;
- the test sample;
- the out-of-time sample;
- the validation folds.
A model that shows a high Gini on training data but drops sharply on the test or out-of-time sample is likely overfitted.
To incorporate stability into the evaluation, a penalized Gini metric is applied:
This metric favors models that deliver solid average performance across folds while showing minimal performance drop between samples.
The same approach can be extended to recall, precision, F1-score, and PR-AUC.
The underlying principle is straightforward: a good scoring model should not only perform well — it should perform reliably.
Selecting the Optimal Number of Variables
Once models that pass statistical checks have been shortlisted, performance is examined as a function of the number of variables included.
The aim is to identify the most compact model that still delivers adequate performance and stability.
More complexity doesn’t automatically mean better results. Adding extra variables might nudge the Gini upward slightly, but it can also erode stability, raise the risk of overfitting, and complicate interpretation.
The final model should balance:
- performance;
- stability;
- interpretability;
- simplicity;
- business consistency.
In scoring, achieving this equilibrium often matters more than chasing the highest possible single metric.
A model built on six stable, interpretable variables may be a better choice than one using ten variables with a marginally higher training Gini.
The Role of Large Language Models
In this article, the code for training, comparing, and selecting models was generated with the help of an artificial intelligence tool — specifically Codex paired with an advanced reasoning model.
The goal isn’t to hand over statistical decision-making to AI. Rather, it’s to leverage AI as a productivity booster for repetitive and technical work.
AI can assist with writing data preparation scripts, automating variable combinations, fitting logistic regressions, calculating performance metrics, verifying statistical constraints, comparing train, test, and out-of-time results, generating summary tables, and documenting the overall workflow.
This positions AI as a capable methodological assistant.
That said, the outputs still require careful review. Statistical tests need proper interpretation. Coefficients must be scrutinized. Business sense must be confirmed. Stability must be evaluated. The final model choice rests with the analyst, not with the tool.
Presenting the Results
The way results are presented should mirror the logic used during model selection.
Start by showing how many candidate variables were considered, how many combinations were tested, and how many models were filtered out at each step. This makes the selection process fully transparent.
Next, present the models that passed statistical validation. These are the models that meet the core criteria: overall significance, individual variable significance, logically consistent risk direction, acceptable VIF values, and stable coefficients.
Then, compare the remaining models using performance and stability indicators:
- average Gini across folds;
- train Gini;
- test Gini;
- out-of-time Gini;
- train-test gap;
- train-out-of-time gap;
- penalized Gini;
- recall;
- precision;
- F1-score;
- PR-AUC.
The top-performing model for each variable count — one that satisfies all statistical and stability requirements — is displayed in the table below.

The final model selection depends on the specific objective. In this case, Model 4 is chosen. The default rate on the training set is 22%, which establishes a minimum PR-AUC benchmark of roughly 22%. A meaningful
The model must attain a PR-AUC significantly higher than this benchmark.
Model 5 delivers the strongest penalized PR-AUC, the strongest penalized recall, and the strongest penalized F1-score. When the main goal is detecting defaults in an operational setting through a classification threshold, Model 5 stands out as a strong candidate.
That said, for a scoring model, the key metric continues to be the ability to rank risk—captured by the Gini index—especially on the test and out-of-time datasets, and in our analysis, the penalized version of the Gini.
Model 4 provides the strongest overall balance for the following reasons:
- It reaches the highest penalized Gini at 56.01%, indicating robust and consistent discriminatory power across all datasets.
- It slightly outperforms Model 3 by including the variable
cb_person_default_on_file, which contributes valuable risk-related information. - Its penalized PR-AUC of 48.44% comfortably exceeds the 22% default rate, validating the model’s capacity to single out borrowers who are likely to default.
- With just 4 variables, it stays highly transparent and straightforward to communicate to business and governance stakeholders.
Based on these considerations, Model 4 is chosen as the final scoring model. The estimated coefficients for this model are shown in the table below:

Lastly, the chart below provides an overview of the final model’s discrimination performance by displaying the Gini index across the training, test, and out-of-time datasets. The findings confirm that no overfitting is present, as the Gini values stay consistent across all three datasets.

The model has been serialized in Python using the pickle format for later use, for example, to generate scores for the various counterparties within the portfolio scope.
Conclusion
In this article, we walked through the essential steps for picking the best candidate model—a model that will then be used to construct a score capable of distinguishing between counterparties across a retail portfolio, with logistic regression serving as the foundational framework.
The findings reveal that the four-variable model strikes the best balance between discriminatory performance, predictive accuracy, and stability over time. With a Gini of roughly 60% and a PR-AUC of around 49%, it demonstrates both a strong capacity to rank risk and a meaningful ability to flag defaulting borrowers—well above the 22% baseline established by the observed default rate.
Throughout this work, we leveraged OpenAI’s Codex agent to help with code generation and chart creation. The outputs were produced by defining the required format, with no further manual refinements. The quality of the results was consistently excellent, confirming that this kind of tool can act as a dependable methodological aid and is poised to significantly shape how scoring models are built going forward.
In the next installment, we will cover how scores are calculated for the various counterparties in the portfolio, along with the individual contribution each variable makes to the final score.
References
[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.
[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.
[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.
[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.
[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.
[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.
[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.
[8] Laborda, J., & Ryoo, S. (2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.
Data & Licensing
The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This license permits anyone to share and adapt the dataset for any purpose, including commercial use, as long as proper credit is given to the original source.
For more details, see the official license text: CC0: Public Domain.
Disclaimer
Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.



