Image by Editor
# Introduction
Let’s face it: the clean, textbook version of data science rarely holds up in practice. We learn methods using perfectly curated, normally distributed data, but real-world projects throw us curveballs — extreme outliers, heavily skewed distributions, and wildly uneven variances.
In an earlier piece on building an exploratory data analysis (EDA) pipeline with Pingouin, we saw how statistical tests can flag when data breaks key assumptions like normality or homoscedasticity. But what happens when those tests come back failed? Discarding the data isn’t the answer — going robust is.
This article walks you through the art of applying robust statistics in your data science workflow. These are mathematical techniques specifically designed to produce trustworthy results even when your data violates standard assumptions or is riddled with outliers and noise. Using a “choose your own adventure” format, we’ll walk through three realistic scenarios using Python’s Pingouin library to tackle the messiest data challenges you’ll face on the job.
# Getting Started
First, let’s install (if you haven’t already) and import Pingouin and Pandas, then load the wine quality dataset available at the link below.
!pip install pingouin pandas
import pandas as pd
import pingouin as pg
# Loading our messy, real-world-like dataset with red and white wine samples
url = "
df = pd.read_csv(url)
# Taking a quick look at what we're working with
df.head()If you read the previous Pingouin article, you’ll recall this is a notoriously messy dataset that falls short on several standard assumptions. Now we’ll dive into three separate “adventures,” each presenting a specific scenario, the core problem it poses, and a robust solution to handle it.
// Adventure 1: When the Normality Test Fails
Imagine we run normality tests on two groups: white wine samples and red wine samples.
white_wine_alcohol = df[df['type'] == 'white']['alcohol']
red_wine_alcohol = df[df['type'] == 'red']['alcohol']
print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))You’ll discover that neither group follows a normal distribution, with extremely low p-values. While non-normality alone doesn’t confirm the presence of outliers or skewness, a strong departure from normality often hints that such issues lurk in the data. Running a t-test to compare means under these conditions would be risky and likely produce misleading results.
The robust solution here is the Mann-Whitney U test. Rather than comparing group averages, this test works with data ranks — essentially sorting all values from lowest to highest, such as arranging all wines by alcohol content. This rank-based strategy is the key trick that neutralizes the outsized influence of outliers. Here’s how to apply it:
# Splitting into our two groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']
# Running the robust Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)Output:
U_val alternative p_val RBC CLES
MWU 3829043.5 two-sided 0.181845 -0.022193 0.488903With a p-value above 0.05, we conclude there’s no statistically significant difference in alcohol content between the two wine types — and this finding is fully resistant to the effects of outliers and skewness.
// Adventure 2: When the Paired T-Test Fails
Now suppose you need to compare two measurements taken from the same subject — say, a patient’s blood sugar before and after taking an experimental drug, or two chemical properties measured from the same bottle of wine. The critical question here is how the differences between paired measurements are distributed. When those differences aren’t normally distributed, a standard paired t-test will give you unreliable confidence intervals.
The go-to robust fix is the Wilcoxon Signed-Rank Test: the non-parametric counterpart to the paired t-test. It works by computing the differences between paired columns and ranking their absolute values. In Pingouin, you simply call pg.wilcoxon() and pass in the two columns containing the paired measurements from the same subject — for example, two types of acidity in wine.
# Running the robust Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)Result:
W_val alternative p_val RBC CLES
Wilcoxon 0.0 two-sided 0.0 1.0 1.0This result reveals a statistically significant difference — a “perfect separation” — between the two measurements. Not only are the two wine properties different, but they also exist on entirely different scales across the dataset.
// Adventure 3: When ANOVA Fails
In this third and final scenario, we want to determine whether residual sugar levels in wine vary significantly across different quality ratings — which range from 3 to 9 as whole numbers, making them suitable to treat as discrete categories.
If Pingouin’s Levene test for homoscedasticity fails badly — for instance, because sugar variance is enormous in low-quality wines but minimal in premium ones — a standard one-way ANOVA can produce misleading conclusions, since it assumes equal variances across all groups.
The remedy is Welch’s ANOVA, which adjusts for groups with higher variance, effectively leveling the playing field and enabling fairer comparisons across multiple categories. Here’s how to run this robust alternative to the classic ANOVA using Pingouin:
# Running Welch's ANOVA to compare sugar levels across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)Result:
Source ddof1 ddof2 F p_unc np2
0 quality 6 54.507934 10.918282 5.937951e-08 0.008353Even in situations where a standard ANOVA would struggle due to unequal variances, Welch’s ANOVA delivers a reliable conclusion. The extremely small p-value provides strong evidence that residual sugar levels do differ significantly across wine quality ratings. Keep in mind, though, that sugar is just one small piece of what determines wine quality — a fact reflected in the low eta-squared value of 0.008.
# Wrapping Up
Through three hands-on scenarios, each pairing a messy-data challenge with a robust statistical technique, we’ve seen that being a great data scientist isn’t about having pristine data or perfecting it — it’s about knowing how to respond when the data throws obstacles your way. Pingouin’s suite of functions implements a range of robust tests that help you sidestep the failed-assumptions trap and draw mathematically sound conclusions with minimal extra effort.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.



