"Revolutionizing Data Analysis: Crafting Next-Gen EDA Workflows With Pingouin"

# Getting Started

If you’ve worked in data science for a while, you’ve likely encountered a core truth: the quality of your machine learning results depends entirely on the quality of your input data. This principle is famously summarized as garbage in, garbage out (GIGO).

For instance, using highly correlated features in a linear regression, or applying ANOVA tests to data with unequal variances, sets you up for unreliable models that fail to capture meaningful patterns.

While standard exploratory data analysis (EDA) tools like graphs and charts are useful, they often fall short when you need to rigorously verify that your data meets the mathematical requirements for further analysis or modeling. Pingouin fills this gap by seamlessly combining the strengths of popular data science libraries like SciPy and pandas. It’s also a powerful tool for creating robust, automated data exploration workflows. This guide walks you through building a comprehensive pipeline for performing statistically sound data analysis and validating key data characteristics.

# Setting Up Your Environment

Begin by installing Pingouin in your Python environment (along with pandas if needed):

!pip install pingouin pandas

Next, import these libraries and load a dataset. We’ll use a public dataset containing wine characteristics and quality ratings.

import pandas as pd
import pingouin as pg

# Loading wine dataset from a public GitHub repo
url = "
df = pd.read_csv(url)

# Preview the first few rows to get familiar with the features
df.head()

# Assessing Single-Variable Normality

Our first specific check focuses on whether individual features follow a normal distribution. Many common machine learning algorithms—as well as statistical methods like t-tests and ANOVA—assume that continuous variables are normally distributed (also called Gaussian). You can easily test this using Pingouin’s pg.normality() function, which runs a Shapiro-Wilk test on each column of your DataFrame:

# Choosing continuous features to test for normality
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']

# Performing the normality check
normality_results = pg.normality(df[features])
print(normality_results)

Sample output:

                         W          pval  normal
fixed acidity     0.879789  2.437973e-57   False
volatile acidity  0.875867  6.255995e-58   False
citric acid       0.964977  5.262332e-37   False
pH                0.991448  2.204049e-19   False
alcohol           0.953532  2.918847e-41   False

It appears that none of our numeric features meet the normality assumption. This isn’t necessarily an error in the data—it’s just part of its natural variation. What it does tell us is that, during later preprocessing steps (outside of our current analysis), we may want to apply transformations such as log or Box-Cox to make the data more normal-like, especially if our final modeling approach requires normally distributed inputs.

# Evaluating Multi-Variable Normality

Beyond checking each variable individually, it’s also valuable to assess overall multivariate normality, which considers how variables interact together. This is particularly important for methods like MANOVA (multivariate ANOVA), where the joint distribution must be normal.

# Henze-Zirkler test for multivariate normality
multivariate_normality_results = pg.multivariate_normality(df[features])
print(multivariate_normality_results)

Sample output:

HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

As shown, you’ll likely see results like: HZResults(...normal=False), indicating that multivariate normality isn’t satisfied either. If you plan to build predictive models with this data, consider using non-parametric methods such as gradient boosting or random forests rather than parametric approaches like linear regression or SVMs, which rely heavily on distributional assumptions.

# Testing for Equal Variances

The next concept—homoscedasticity—might sound complex, but it simply means having consistent variance across prediction errors, acting as a sign of stability. We’ll verify this using Levene’s test via Pingouin:

# Levene’s test for homogeneity of variances
# 'dv' = outcome variable, 'group' = categorical grouping
homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
print(homoscedasticity_results)

Output example:

                W          pval  equal_var
levene  66.338684  2.317649e-80      False

Again, we get False, meaning the condition of equal variance (homoscedasticity) isn’t met—this is known as heteroscedasticity. You’ll need to address this in downstream modeling, for example by using robust standard errors when fitting regression models.

# Verifying Sphericity

Another important statistical property is sphericity, which confirms whether the variances of differences between all possible pairs of related conditions are roughly equal. This check is essential before applying techniques like Principal Component Analysis (PCA) for dimensionality reduction, since PCA relies on meaningful correlations among variables; if correlations are absent, PCA won’t add value.

# Mauchly’s test for sphericity
sphericity_results = pg.sphericity(df[features])
print(sphericity_results)

Output example:

SpherResults(spher=False, W=np.float64(0.004437706589942777),


It appears we've selected a particularly tough, dry dataset! Don't worry — this article is deliberately structured to emphasize the EDA (Exploratory Data Analysis) process and guide you in spotting numerous data issues like these. Ultimately, catching them early and understanding how to address them before moving on to downstream machine learning tasks is far more valuable than constructing a model that could be flawed from the start. In this instance, there's a twist: we have a p-value of 0.0, which means the null hypothesis of an identity correlation matrix is rejected — in other words, meaningful correlations do exist among the variables. So if we had a large number of features and wanted to reduce dimensionality, applying PCA (Principal Component Analysis) could be a smart move.

Checking Multicollinearity

Finally, let's examine multicollinearity — a condition that reveals whether any predictors are strongly correlated with one another. This can become a problematic issue in interpretable models such as linear regressors. Here's how we check it:


# Computing a robust correlation matrix with p-values
correlation_matrix = pg.rcorr(df[features], method='pearson')
print(correlation_matrix)
Output matrix:
                 fixed acidity volatile acidity citric acid     pH alcohol
fixed acidity                -              ***         ***    ***     ***
volatile acidity         0.219                -         ***    ***      **
citric acid              0.324           -0.378           -    ***        
pH                      -0.253            0.261       -0.33      -     ***
alcohol                 -0.095           -0.038       -0.01  0.121       -
While pandas' corr() method can also accomplish this, Pingouin's version uses asterisks to denote the statistical significance of each correlation (* for p < 0.05, ** for p < 0.01, and *** for p < 0.001). A correlation may be statistically significant yet still weak in magnitude — multicollinearity only becomes a real concern when the absolute correlation value is high (generally above 0.8). In our case, none of the pairwise correlations are alarmingly large, meaning all five evaluated features contribute mostly independent, unique information for subsequent analyses.
Wrapping Up
Through a step-by-step series of practical examples, we've explored how to harness the power of Pingouin — an open-source Python library — to build robust, modern EDA pipelines. These pipelines empower you to make smarter decisions in data preprocessing and downstream analyses, whether you're leveraging advanced statistical tests or machine learning models, ultimately helping you choose the right steps to take and the right models to apply.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Top Posts

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

“Revolutionizing Data Analysis: Crafting Next-Gen EDA Workflows with Pingouin”

`Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission`

`Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers`

`Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations`

`Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi`

`Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System`

`Endless Code: Mastering the Art of the 24-Hour Claude Agent`

`Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes`

`Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility`

`Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software`

`Cisco’s Budget AI Guardians: Revolutionizing Source Code Security`

`Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth`

`Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission`

`Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey`

`Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening`

`Trending`

`Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes`

`Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility`

`Latest Posts`

`Not More Data, but Better World Models – Unite.AI`

`OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears`

Subscribe to Updates

Top Posts

“Revolutionizing Data Analysis: Crafting Next-Gen EDA Workflows with Pingouin”

# Getting Started

# Setting Up Your Environment

# Assessing Single-Variable Normality

# Evaluating Multi-Variable Normality

# Testing for Equal Variances

# Verifying Sphericity

Checking Multicollinearity

Wrapping Up

Related Posts

`Related Posts`