you to your suggestions and curiosity in my earlier article. Since a number of readers requested methods to replicate the evaluation, I made a decision to share the total code on GitHub for each this text and the earlier one. This can help you simply reproduce the outcomes, higher perceive the methodology, and discover the venture in additional element.
On this put up, we present that analyzing the relationships between variables in credit score scoring serves two principal functions:
- Evaluating the power of explanatory variables to discriminate default (see part 1.1)
- Lowering dimensionality by learning the relationships between explanatory variables (see part 1.2)
- In Part 1.3, we apply these strategies to the dataset launched in our earlier put up.
- In conclusion, we summarize the important thing takeaways and spotlight the details that may be helpful for interviews, whether or not for an internship or a full-time place.
As we develop and enhance our modeling abilities, we frequently look again and smile at our early makes an attempt, the primary fashions we constructed, and the errors we made alongside the way in which.
I bear in mind constructing a scoring mannequin utilizing Kaggle assets with out really understanding methods to analyze relationships between variables. Whether or not it concerned two steady variables, a steady and a categorical variable, or two categorical variables, I lacked each the graphical instinct and the statistical instruments wanted to review them correctly.
It wasn’t till my third 12 months, throughout a credit score scoring venture, that I absolutely grasped their significance. That have is why I strongly encourage anybody constructing their first scoring mannequin to take the evaluation of relationships between variables critically.
Why Learning Relationships Between Variables Issues
The primary goal is to determine the variables that greatest clarify the phenomenon below examine, for instance, predicting default.
Nevertheless, correlation is just not causation. Any perception have to be supported by:
- tutorial analysis
- area experience
- information visualization
- and professional judgment
The second goal is dimensionality discount. By defining applicable thresholds, we will preselect variables that present significant associations with the goal or with different predictors. This helps scale back redundancy and enhance mannequin efficiency.
It additionally gives early steerage on which variables are prone to be retained within the last mannequin and helps detect potential modeling points. As an example, if a variable with no significant relationship to the goal leads to the ultimate mannequin, this will likely point out a weak point within the modeling course of. In such circumstances, you will need to revisit earlier steps and determine doable shortcomings.
On this article, we concentrate on three varieties of relationships:
- Two steady variables
- One steady and one qualitative variable
- Two qualitative variables
All analyses are carried out on the coaching dataset. In a earlier article, we addressed outliers and lacking values, a necessary prerequisite earlier than any statistical evaluation. Subsequently, we’ll work with a cleaned dataset to investigate relationships between variables.
Outliers and lacking values can considerably distort each statistical measures and visible interpretations of relationships. That is why it’s vital to make sure that preprocessing steps, akin to dealing with lacking values and outliers, are carried out fastidiously and appropriately.
The purpose of this text is to not present an exhaustive checklist of statistical checks for measuring associations between variables. As a substitute, it goals to provide the important foundations wanted to grasp the significance of this step in constructing a dependable scoring mannequin.
The strategies introduced listed here are among the many mostly utilized in apply. Nevertheless, relying on the context, analysts might depend on extra or extra superior strategies.
By the tip of this text, you need to have the ability to confidently reply the next three questions, which are sometimes requested in internships or job interviews:
- How do you measure the connection between two steady variables?
- How do you measure the connection between two qualitative variables?
- How do you measure the connection between a qualitative variable and a steady variable?
Graphical Evaluation
I initially wished to skip this step and go straight to statistical testing. Nevertheless, since this text is meant for novices in modeling, that is arguably an important half.
Each time you’ve the chance to visualise your information, you need to take it. Visualization can reveal an incredible deal in regards to the underlying construction of the info, typically greater than a single statistical metric.
This step is especially vital through the exploratory section, in addition to throughout decision-making and discussions with area specialists. The insights derived from visualizations ought to at all times be validated by:
- subject material specialists
- the context of the examine
- and related tutorial or scientific literature
By combining these views, we will remove variables that aren’t related to the issue or that will result in deceptive conclusions. On the similar time, we will determine probably the most informative variables that actually assist clarify the phenomenon below examine.
When this step is fastidiously executed and supported by tutorial analysis and professional validation, we will have better confidence within the statistical checks that observe, which finally summarize the knowledge into indicators akin to p-values or correlation coefficients.
In credit score scoring, the target is to pick, from a set of candidate variables, people who greatest clarify the goal, usually default.
That is why we examine relationships between variables.
We are going to see later that some fashions are delicate to multicollinearity, which happens when a number of variables carry comparable info. Lowering redundancy is subsequently important.
In our case, the goal variable is binary (default vs. non-default), and we purpose to discriminate it utilizing explanatory variables that could be both steady or categorical.
Graphically, we will assess the discriminative energy of those variables, that’s, their means to foretell the default consequence. Within the following part, we current graphical strategies and take a look at statistics that may be automated to investigate the connection between steady or categorical explanatory variables and the goal variable, utilizing programming languages akin to Python.
1.1 Analysis of Predictive Energy
On this part, we current the graphical and statistical instruments used to evaluate the power of each steady and categorical explanatory variables to seize the connection with the goal variable, specifically default (def).
1.1.1 Steady Variable vs. Binary Goal
If the variable we’re evaluating is steady, the purpose is to match its distribution throughout the 2 goal courses:
- non-default ()
- default ()
We will use:
- boxplots to match medians and dispersion
- density plots (KDE) to match distributions
- cumulative distribution capabilities (ECDF)
The important thing thought is straightforward:
Does the distribution of the variable differ between defaulters and non-defaulters?
If the reply is sure, the variable might have discriminative energy.
Assume we wish to assess how nicely person_income discriminates between defaulting and non-defaulting debtors. Graphically, we will examine abstract statistics such because the imply or median, in addition to the distribution by density plots or cumulative distribution capabilities (CDFs) for defaulted and non-defaulted counterparties. The ensuing visualization is proven beneath.
def plot_continuous_vs_categorical(
df,
continuous_var,
categorical_var,
category_labels=None,
figsize=(12, 10),
pattern=None
):
"""
Examine a steady variable throughout classes
utilizing boxplot, KDE, and ECDF (2x2 structure).
"""
sns.set_style("white")
information = df[[continuous_var, categorical_var]].dropna().copy()
# Non-compulsory sampling
if pattern:
information = information.pattern(pattern, random_state=42)
classes = sorted(information[categorical_var].distinctive())
# Labels mapping (non-compulsory)
if category_labels:
labels = [category_labels.get(cat, str(cat)) for cat in categories]
else:
labels = [str(cat) for cat in categories]
fig, axes = plt.subplots(2, 2, figsize=figsize)
# --- 1. Boxplot ---
sns.boxplot(
information=information,
x=categorical_var,
y=continuous_var,
ax=axes[0, 0]
)
axes[0, 0].set_title("Boxplot (median & spread)", loc="left")
# --- 2. Boxplot comparaison médianes ---
sns.boxplot(
information=information,
x=categorical_var,
y=continuous_var,
ax=axes[0, 1],
showmeans=True,
meanprops={
"marker": "o",
"markerfacecolor": "white",
"markeredgecolor": "black",
"markersize": 6
}
)
axes[0, 1].set_title("Median comparison (Boxplot)", loc="left")
medians = information.groupby(categorical_var)[continuous_var].median()
for i, cat in enumerate(classes):
axes[0, 1].textual content(
i,
medians[cat],
f"{medians[cat]:.2f}",
ha='middle',
va='backside',
fontsize=10,
fontweight='daring'
)
# --- 3. KDE solely ---
for cat, label in zip(classes, labels):
subset = information[data[categorical_var] == cat][continuous_var]
sns.kdeplot(
subset,
ax=axes[1, 0],
label=label
)
axes[1, 0].set_title("Density comparison (KDE)", loc="left")
axes[1, 0].legend()
# --- 4. ECDF ---
for cat, label in zip(classes, labels):
subset = np.type(information[data[categorical_var] == cat][continuous_var])
y = np.arange(1, len(subset) + 1) / len(subset)
axes[1, 1].plot(subset, y, label=label)
axes[1, 1].set_title("Cumulative distribution (ECDF)", loc="left")
axes[1, 1].legend()
# Clear model (Storytelling with Knowledge)
for ax in axes.flat:
sns.despine(ax=ax)
ax.grid(axis="y", alpha=0.2)
plt.tight_layout()
plt.present()
plot_continuous_vs_categorical(
df=train_imputed,
continuous_var="person_income",
categorical_var="def",
category_labels={0: "No Default", 1: "Default"},
figsize=(14, 12),
pattern=5000
)Defaulted debtors are inclined to have decrease incomes than non-defaulted debtors. The distributions present a transparent shift, with defaults concentrated at decrease earnings ranges. General, earnings has good discriminatory energy for predicting default.
1.1.2 Statistical Take a look at: Kruskal–Wallis for a Steady Variable vs. a Binary Goal
To formally assess this relationship, we use the Kruskal–Wallis take a look at, a non-parametric technique.
It evaluates whether or not a number of impartial samples come from the identical distribution.
Extra exactly, it checks whether or not okay samples (okay ≥ 2) originate from the identical inhabitants, or from populations with equivalent traits when it comes to a place parameter. This parameter is conceptually near the median, however the Kruskal–Wallis take a look at incorporates extra info than the median alone.
The precept of the take a look at is as follows. Let () denote the place parameter of pattern i. The hypotheses are:
- : There exists at the very least one pair (i, j) such that
When ( okay = 2 ), the Kruskal–Wallis take a look at reduces to the Mann–Whitney U take a look at.
The take a look at statistic roughly follows a Chi-square distribution with Ok-1 levels of freedom (for sufficiently massive samples).
- If the p-value < 5%, we reject
- This means that at the very least one group differs considerably
Subsequently, for a given quantitative explanatory variable, if the p-value is lower than 5%, the null speculation is rejected, and we’d conclude that the thought of explanatory variable could also be predictive within the mannequin.
1.1.3 Qualitative Variable vs. Binary Goal
If the explanatory variable is qualitative, the suitable device is the contingency desk, which summarizes the joint distribution of the 2 variables.
It exhibits how the classes of the explanatory variable are distributed throughout the 2 courses of the goal. For instance the connection between person_home_ownership and the default variable, the contingency desk is given by :
def contingency_analysis(
df,
var1,
var2,
normalize=None, # None, "index", "columns", "all"
plot=True,
figsize=(8, 6)
):
"""
operate to compute and visualize contingency desk
+ Chi-square take a look at + Cramér's V.
"""
# --- Contingency desk ---
desk = pd.crosstab(df[var1], df[var2], margins=False)
# --- Normalized model (non-compulsory) ---
if normalize:
table_norm = pd.crosstab(df[var1], df[var2], normalize=normalize, margins=False).spherical(3) * 100
else:
table_norm = None
# --- Plot (heatmap) ---
if plot:
sns.set_style("white")
plt.determine(figsize=figsize)
data_to_plot = table_norm if table_norm is just not None else desk
sns.heatmap(
data_to_plot,
annot=True,
fmt=".2f" if normalize else "d",
cbar=True
)
plt.title(f"{var1} vs {var2} (Contingency Table)", loc="left", weight="bold")
plt.xlabel(var2)
plt.ylabel(var1)
sns.despine()
plt.tight_layout()
plt.present()

From this desk, we will:
- Examine conditional distributions throughout classes.

- Debtors who hire or fall into “other” classes default extra typically, whereas owners have the bottom default fee.
Mortgage holders are in between, suggesting average threat.
For visualization, grouped bar charts are sometimes used. They supply an intuitive option to examine conditional proportions throughout classes.
def plot_grouped_bar(df, cat_var, subcat_var,
normalize="index", title=""):
ct = pd.crosstab(df[subcat_var], df[cat_var], normalize=normalize) * 100
modalities = ct.index.tolist()
classes = ct.columns.tolist()
n_mod = len(modalities)
n_cat = len(classes)
x = np.arange(n_mod)
width = 0.35
colours = ['#0F6E56', '#993C1D'] # teal = non-défaut, coral = défaut
fig, ax = plt.subplots(figsize=(7.24, 4.07), dpi=100)
for i, (cat, colour) in enumerate(zip(classes, colours)):
offset = (i - n_cat / 2 + 0.5) * width
ax.bar(x + offset, ct[cat], width=width, colour=colour, label=str(cat))
# Annotations au-dessus de chaque barre
for j, val in enumerate(ct[cat]):
ax.textual content(x[j] + offset, val + 0.5, f"{val:.1f}%",
ha='middle', va='backside', fontsize=9, colour='#444')
# Model Cole
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.yaxis.grid(True, colour='#e0e0e0', linewidth=0.8, zorder=0)
ax.set_axisbelow(True)
ax.set_xticks(x)
ax.set_xticklabels(modalities, fontsize=11)
ax.set_ylabel("Taux (%)" if normalize else "Effectifs", fontsize=11, colour='#555')
ax.tick_params(left=False, colours='#555')
handles = [mpatches.Patch(color=c, label=str(l))
for c, l in zip(colors, categories)]
ax.legend(handles=handles, title=cat_var, frameon=False,
fontsize=10, loc='higher proper')
ax.set_title(title, fontsize=13, fontweight='regular', pad=14)
plt.tight_layout()
plt.savefig("default_by_ownership.png", dpi=150, bbox_inches='tight')
plt.present()
plot_grouped_bar(
df=train_imputed,
cat_var="def",
subcat_var="person_home_ownership",
normalize="index",
title="Default Rate by Home Ownership"
)
1.1.4 Statistical Take a look at: Evaluation of the Hyperlink Between Default and Qualitative Explanatory Variables
The statistical take a look at used is the chi-square take a look at, which is a take a look at of independence.
It goals to match two variables in a contingency desk to find out whether or not they’re associated. Extra typically, it assesses whether or not the distributions of categorical variables differ from one another.
A small chi-square statistic signifies that the noticed information are near the anticipated information below independence. In different phrases, there isn’t a proof of a relationship between the variables.
Conversely, a big chi-square statistic signifies a better discrepancy between noticed and anticipated frequencies, suggesting a possible relationship between the variables. If the p-value of the chi-square take a look at is beneath 5%, we reject the null speculation of independence and conclude that the variables are dependent.
Nevertheless, this take a look at doesn’t measure the energy of the connection and is delicate to each the pattern dimension and the construction of the classes. That is why we flip to Cramer’s V, which gives a extra informative measure of affiliation.
The Cramer’s V is derived from a chi-square independence take a look at and quantifies the depth of the relation between two qualitative variables and .
The coefficient may be expressed as follows:
- is the phi coefficient.
- is derived from Pearson’s chi-squared take a look at or contingency desk.
- n is the full variety of observations and
- okay being the variety of columns of the contingency desk
- r being the variety of rows of the contingency desk.
Cramér’s V takes values between 0 and 1. Relying on its worth, the energy of the affiliation may be interpreted as follows:
- > 0.5 → Excessive affiliation
- 0.3 – 0.5 → Reasonable affiliation
- 0.1 – 0.3 → Low affiliation
- 0 – 0.1 → Little to no affiliation
For instance, we will contemplate {that a} variable is considerably related to the goal when Cramér’s V exceeds a given threshold (0.5 or 50%), relying on the extent of selectivity required for the evaluation.
Graphical instruments are generally used to evaluate the discriminatory energy of variables. They’ll additionally assist consider the relationships between explanatory variables. This evaluation goals to cut back the variety of variables by figuring out people who present redundant info.
It’s usually carried out on variables of the identical sort—steady variables with steady variables, or categorical variables with categorical variables—since particular measures are designed for every case. For instance, we will use Spearman correlation for steady variables, or Cramér’s V and Tschuprow’s T for categorical variables to quantify the energy of affiliation.
Within the following part, we assume that the obtainable variables are related for discriminating default. It subsequently turns into applicable to make use of statistical checks to additional examine the relationships between variables. We are going to describe a structured methodology for choosing the suitable checks and supply clear justification for these selections.
The purpose is to not cowl each doable take a look at, however slightly to current a coherent and sturdy strategy that may information you in constructing a dependable scoring mannequin.
1.2 Multicollinearity Between Variables
In credit score scoring, once we speak about multicollinearity, the very first thing that often involves thoughts is the Variance Inflation Issue (VIF). Nevertheless, there’s a a lot less complicated strategy that can be utilized when coping with a lot of explanatory variables. This strategy permits for an preliminary screening of related variables and helps scale back dimensionality by analyzing the relationships between variables of the identical sort.
Within the following sections, we present how learning the relationships between steady variables and between categorical variables may also help determine redundant info and assist the preselection of explanatory variables.
1.2.1 Take a look at statistic for the examine: Relationship Between Steady Explanatory Variables
In scoring fashions, analyzing the connection between two steady variables is usually used to pre-select variables and scale back dimensionality. This evaluation turns into notably related when the variety of explanatory variables could be very massive (e.g., greater than 100), as it may possibly considerably scale back the variety of variables.
On this part, we concentrate on the case of two steady explanatory variables. Within the subsequent part, we’ll look at the case of two categorical variables.
To review this affiliation, the Pearson correlation can be utilized. Nevertheless, most often, the Spearman correlation is most popular, as it’s a non-parametric measure. In distinction, Pearson correlation solely captures linear relationships between variables.
Spearman correlation is commonly most popular in apply as a result of it’s sturdy to outliers and doesn’t depend on distributional assumptions. It measures how nicely the connection between two variables may be described by a monotonic operate, whether or not linear or not.
Mathematically, it’s computed by making use of the Pearson correlation method to the ranked variables:
Subsequently, on this context, Spearman correlation is chosen to evaluate the connection between two steady variables.
If two or extra impartial steady variables exhibit a excessive pairwise Spearman correlation (e.g., ≥ 0.6 or 60%), this means that they carry comparable info. In such circumstances, it’s applicable to retain solely considered one of them—both the variable that’s most strongly correlated with the goal (default) or the one thought of most related based mostly on area experience.
1.2.2 Take a look at statistic for the examine: Relationship Between Qualitative Explanatory Variables.
As within the evaluation of the connection between an explanatory variable and the goal (default), Cramér’s V is used right here to evaluate whether or not two or extra qualitative variables present the identical info.
For instance, if Cramér’s V exceeds 0.5 (50%), the variables are thought of to be extremely related and will seize comparable info. Subsequently, they shouldn’t be included concurrently within the mannequin, as this may introduce redundancy.
The selection of which variable to retain may be based mostly on statistical standards—akin to retaining the variable that’s most strongly related to the goal (default)—or on area experience, by choosing the variable thought of probably the most related from a enterprise perspective.
As you might have seen, we examine the connection between a steady variable and a categorical variable as a part of the dimensionality discount course of, since there isn’t a direct indicator to measure the energy of the affiliation, not like Spearman correlation or Cramér’s V.
For these , one doable strategy is to make use of the Variance Inflation Issue (VIF). We are going to cowl this in a future publication. It’s not mentioned right here as a result of the methodology for computing VIF might differ relying on whether or not you employ Python or R. These particular features will probably be addressed within the subsequent put up.
Within the following part, we’ll apply all the pieces mentioned to this point to real-world information, particularly the dataset launched in our earlier article.
1.3 Software in the true information
This part analyzes the correlations between variables and contributes to the pre-selection of variables. The information used are these from the earlier article, the place outliers and lacking values have been already handled.
Three varieties of correlations (every utilizing a distinct statistical take a look at seen above) are analyzed :
- Correlation between steady variables and the default variable (Kruskall-Wallis take a look at)
- Correlations between qualitatives variables and the default variables (Cramer’s V).
- Multi-correlations between steady variables (Spearman take a look at)
- Multi-correlations between qualitatives variables (Cramer’s V)
1.3.1 Correlation between steady variables and the default variable
Within the prepare database, now we have seven steady variables :
- person_income
- person_age
- person_emp_length
- loan_amnt
- loan_int_rate
- loan_percent_income
The desk beneath presents the p-values from the Kruskal–Wallis take a look at, which measure the connection between these variables and the default variable.
def correlation_quanti_def_KW(database: pd.DataFrame,
continuous_vars: checklist,
goal: str) -> pd.DataFrame:
"""
Compute Kruskal-Wallis take a look at p-values between steady variables
and a categorical (binary or multi-class) goal.
Parameters
----------
database : pd.DataFrame
Enter dataset
continuous_vars : checklist
Checklist of steady variable names
goal : str
Goal variable title (categorical)
Returns
-------
pd.DataFrame
Desk with variables and corresponding p-values
"""
outcomes = []
for var in continuous_vars:
# Drop NA for present variable + goal
df = database[[var, target]].dropna()
# Group values by goal classes
teams = [
group[var].values
for _, group in df.groupby(goal)
]
# Kruskal-Wallis requires at the very least 2 teams
if len(teams) < 2:
p_value = None
else:
attempt:
stat, p_value = kruskal(*teams)
besides ValueError:
# Handles edge circumstances (e.g., fixed values)
p_value = None
outcomes.append({
"variable": var,
"p_value": p_value,
"stats_kw": stat if 'stat' in locals() else None
})
return pd.DataFrame(outcomes).sort_values(by="p_value")
continuous_vars = [
"person_income",
"person_age",
"person_emp_length",
"loan_amnt",
"loan_int_rate",
"loan_percent_income",
"cb_person_cred_hist_length"
]
goal = "def"
consequence = correlation_quanti_def_KW(
database=train_imputed,
continuous_vars=continuous_vars,
goal=goal
)
print(consequence)
# Save outcomes to xlsx
consequence.to_excel(f"{data_output_path}/correlation/correlations_kw.xlsx", index=False)
By evaluating the p-values to the 5% significance stage, we observe that each one are beneath the edge. Subsequently, we reject the null speculation for all variables and conclude that every steady variable is considerably related to the default variable.
1.3.2 Correlations between qualitative variables and the default variables (Cramer’s V).
Within the database, now we have 4 qualitative variables :
- person_home_ownership
- cb_person_default_on_file
- loan_intent
- loan_grade
The desk beneath studies the energy of the affiliation between these categorical variables and the default variable, as measured by Cramér’s V.
def cramers_v_with_target(database: pd.DataFrame,
categorical_vars: checklist,
goal: str) -> pd.DataFrame:
"""
Compute Chi-square statistic and Cramér's V between a number of
categorical variables and a goal variable.
Parameters
----------
database : pd.DataFrame
Enter dataset
categorical_vars : checklist
Checklist of categorical variables
goal : str
Goal variable (categorical)
Returns
-------
pd.DataFrame
Desk with variable, chi2 and Cramér's V
"""
outcomes = []
for var in categorical_vars:
# Drop lacking values
df = database[[var, target]].dropna()
# Contingency desk
contingency_table = pd.crosstab(df[var], df[target])
# Skip if not sufficient information
if contingency_table.form[0] < 2 or contingency_table.form[1] < 2:
outcomes.append({
"variable": var,
"chi2": None,
"cramers_v": None
})
proceed
attempt:
chi2, _, _, _ = chi2_contingency(contingency_table)
n = contingency_table.values.sum()
r, okay = contingency_table.form
v = np.sqrt((chi2 / n) / min(okay - 1, r - 1))
besides Exception:
chi2, v = None, None
outcomes.append({
"variable": var,
"chi2": chi2,
"cramers_v": v
})
result_df = pd.DataFrame(outcomes)
# Choice : tri par significance
return result_df.sort_values(by="cramers_v", ascending=False)
qualitative_vars = [
"person_home_ownership",
"cb_person_default_on_file",
"loan_intent",
"loan_grade",
]
consequence = cramers_v_with_target(
database=train_imputed,
categorical_vars=qualitative_vars,
goal=goal
)
print(consequence)
# Save outcomes to xlsx
consequence.to_excel(f"{data_output_path}/correlation/cramers_v.xlsx", index=False)
The outcomes point out that the majority variables are related to the default variable. A average affiliation is noticed for loan_grade, whereas the opposite categorical variables exhibit weak associations.
1.3.3 Multi-correlations between steady variables (Spearman take a look at)
To determine steady variables that present comparable info, we use the Spearman correlation with a threshold of 60%. That’s, if two steady explanatory variables exhibit a Spearman correlation above 60%, they’re thought of redundant and to seize comparable info.
def correlation_matrix_quanti(database: pd.DataFrame,
continuous_vars: checklist,
technique: str = "spearman",
as_percent: bool = False) -> pd.DataFrame:
"""
Compute correlation matrix for steady variables.
Parameters
----------
database : pd.DataFrame
Enter dataset
continuous_vars : checklist
Checklist of steady variables
technique : str
Correlation technique ("pearson" or "spearman"), default = "spearman"
as_percent : bool
If True, return values in proportion
Returns
-------
pd.DataFrame
Correlation matrix
"""
# Choose related information and drop rows with NA
df = database[continuous_vars].dropna()
# Compute correlation matrix
corr_matrix = df.corr(technique=technique)
# Convert to proportion if required
if as_percent:
corr_matrix = corr_matrix * 100
return corr_matrix
corr = correlation_matrix_quanti(
database=train_imputed,
continuous_vars=continuous_vars,
technique="spearman"
)
print(corr)
# Save outcomes to xlsx
corr.to_excel(f"{data_output_path}/correlation/correlation_matrix_spearman.xlsx")
We determine two pairs of variables which can be extremely correlated:
- The pair (cb_person_cred_hist_length, person_age) with a correlation of 85%
- The pair (loan_percent_income, loan_amnt) with a excessive correlation
Just one variable from every pair needs to be retained for modeling. We depend on statistical standards to pick the variable that’s most strongly related to the default variable. On this case, we retain person_age and loan_percent_income.
1.3.4 Multi-correlations between qualitative variables (Cramer’s V)
On this part, we analyze the relationships between categorical variables. If two categorical variables are related to a Cramér’s V better than 60%, considered one of them needs to be faraway from the candidate threat driver checklist to keep away from introducing extremely correlated variables into the mannequin.
The selection between the 2 variables may be based mostly on professional judgment. Nevertheless, on this case, we depend on a statistical strategy and choose the variable that’s most strongly related to the default variable.
The desk beneath presents the Cramér’s V matrix computed for every pair of categorical explanatory variables.
def cramers_v_matrix(database: pd.DataFrame,
categorical_vars: checklist,
corrected: bool = False,
as_percent: bool = False) -> pd.DataFrame:
"""
Compute Cramér's V correlation matrix for categorical variables.
Parameters
----------
database : pd.DataFrame
Enter dataset
categorical_vars : checklist
Checklist of categorical variables
corrected : bool
Apply bias correction (advisable)
as_percent : bool
Return values in proportion
Returns
-------
pd.DataFrame
Cramér's V matrix
"""
def cramers_v(x, y):
# Drop NA
df = pd.DataFrame({"x": x, "y": y}).dropna()
contingency_table = pd.crosstab(df["x"], df["y"])
if contingency_table.form[0] < 2 or contingency_table.form[1] < 2:
return np.nan
chi2, _, _, _ = chi2_contingency(contingency_table)
n = contingency_table.values.sum()
r, okay = contingency_table.form
phi2 = chi2 / n
if corrected:
# Bergsma correction
phi2_corr = max(0, phi2 - ((k-1)*(r-1)) / (n-1))
r_corr = r - ((r-1)**2) / (n-1)
k_corr = okay - ((k-1)**2) / (n-1)
denom = min(k_corr - 1, r_corr - 1)
else:
denom = min(okay - 1, r - 1)
if denom <= 0:
return np.nan
return np.sqrt(phi2_corr / denom) if corrected else np.sqrt(phi2 / denom)
# Initialize matrix
n = len(categorical_vars)
matrix = pd.DataFrame(np.zeros((n, n)),
index=categorical_vars,
columns=categorical_vars)
# Fill matrix
for i, var1 in enumerate(categorical_vars):
for j, var2 in enumerate(categorical_vars):
if i <= j:
worth = cramers_v(database[var1], database[var2])
matrix.loc[var1, var2] = worth
matrix.loc[var2, var1] = worth
# Convert to proportion
if as_percent:
matrix = matrix * 100
return matrix
matrix = cramers_v_matrix(
database=train_imputed,
categorical_vars=qualitative_vars,
)
print(matrix)
# Save outcomes to xlsx
matrix.to_excel(f"{data_output_path}/correlation/cramers_v_matrix.xlsx")
From this desk, utilizing a 60% threshold, we observe that just one pair of variables is strongly related: (loan_grade, cb_person_default_on_file). The variable we retain is loan_grade, as it’s extra strongly related to the default variable.
Based mostly on these analyses, now we have pre-selected 9 variables for the following steps. Two variables have been eliminated through the evaluation of correlations between steady variables, and one variable was eliminated through the evaluation of correlations between categorical variables.
Conclusion
The target of this put up was to current methods to measure the totally different relationships that exist between variables in a credit score scoring mannequin.
We’ve got seen that this evaluation can be utilized to judge the discriminatory energy of explanatory variables, that’s, their means to foretell the default variable. When the explanatory variable is steady, we will depend on the non-parametric Kruskal–Wallis take a look at to evaluate the connection between the variable and default.
When the explanatory variable is categorical, we use Cramér’s V, which measures the energy of the affiliation and is much less delicate to pattern dimension than the chi-square take a look at alone.
Lastly, now we have proven that analyzing relationships between variables additionally helps scale back dimensionality by figuring out multicollinearity, particularly when variables are of the identical sort.
For 2 steady explanatory variables, we will use the Spearman correlation, with a threshold (e.g., 60%). If the Spearman correlation exceeds this threshold, the 2 variables are thought of redundant and mustn’t each be included within the mannequin. One can then be chosen based mostly on its relationship with the default variable or based mostly on area experience.
For 2 categorical explanatory variables, we once more use Cramér’s V. By setting a threshold (e.g., 50%), we will assume that if Cramér’s V exceeds this worth, the variables present comparable info. On this case, solely one of many two variables needs to be retained—both based mostly on its discriminatory energy or by professional judgment.
In apply, we utilized these strategies to the dataset processed in our earlier put up. Whereas this strategy is efficient, it isn’t probably the most sturdy technique for variable choice. In our subsequent put up, we’ll current a extra sturdy strategy for pre-selecting variables in a scoring mannequin.
Picture Credit
All photos and visualizations on this article have been created by the creator utilizing Python (pandas, matplotlib, seaborn, and plotly) and excel, except in any other case said.
References
[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Important Analysis.
Nationwide Library of Medication, 2016.
[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.
[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Knowledge for Neural Networks.
Journal of Massive Knowledge, 7(28), 2020.
[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
A number of Imputation by Chained Equations: What Is It and How Does It Work?
Worldwide Journal of Strategies in Psychiatric Analysis, 2011.
[5] Majid Sarmad.
Sturdy Knowledge Evaluation for Factorial Experimental Designs: Improved Strategies and Software program.
Division of Mathematical Sciences, College of Durham, England, 2006.
[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Lacking Worth Imputation for Combined-Kind Knowledge.Bioinformatics, 2011.
[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Climate Anomaly Detection Utilizing the DBSCAN Clustering Algorithm.
Journal of Physics: Convention Collection, 2021.
[8] Laborda, J., & Ryoo, S. (2021). Function choice in a credit score scoring mannequin. Arithmetic, 9(7), 746.
Knowledge & Licensing
The dataset used on this article is licensed below the Artistic Commons Attribution 4.0 Worldwide (CC BY 4.0) license.
This license permits anybody to share and adapt the dataset for any objective, together with business use, offered that correct attribution is given to the supply.
For extra particulars, see the official license textual content: CC0: Public Area.
Disclaimer
Any remaining errors or inaccuracies are the creator’s duty. Suggestions and corrections are welcome.



