"Masking Real-World Data: Secure Anonymization Techniques For Data Science Using Mimesis"

# Introduction

Production data is often bound by strict privacy and compliance regulations. As a result, anonymizing this data is essential in nearly every real-world data science initiative that involves deploying a data-driven product, service, or solution.

Mimesis is an open-source Python library known for its ability to quickly generate realistic synthetic data. It runs entirely on your local machine and offers a free, reliable solution for building data pipelines. This guide will walk you through how to use this library to anonymize sensitive production data, using a practical, step-by-step example that you can easily replicate in your IDE or a notebook.

# Step-by-Step Guide

If you’re just getting started with Mimesis, you’ll first need to install it in your Python environment using a command like:

Don’t forget to prepend ! to the pip command if you’re working in a Google Colab notebook or a similar environment.

Now we’re all set! Let’s imagine a scenario involving a software product’s subscription system with different tiers. For simplicity, we’ll create a small synthetic dataset containing customer information and their subscription levels. Some of the fields in this dataset contain highly sensitive data, as shown below:

import pandas as pd

# Building a mock "production" customer dataset
production_data = {
    'user_id': [101, 102, 103, 104],
    'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
    'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
    'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
    'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}

df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())

In our example, subscription tiers aren’t considered sensitive, but user names, emails, and phone numbers certainly are. Using Mimesis, we can set up a provider — essentially a customized data anonymization template designed for the type of data we’re working with. Since our records relate to individuals, we can import and use the Person class — a provider that, when given a specific language such as English and a random seed, can generate convincing replacements for real, sensitive personal information:

from mimesis import Person
from mimesis.locales import Locale

# Setting up a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)

From here on, anonymizing personally identifiable information (PII) is straightforward. All you need to do is swap out the sensitive columns — which you specify — with newly generated data from the Mimesis person locale generator. This is accomplished by looping through the DataFrame that holds the entire dataset and invoking the appropriate Mimesis functions to create realistic replacements for each attribute:

# 1. Swapping real names with realistic fake names
df['real_name'] = [person.full_name() for _ in range(len(df))]

# 2. Swapping real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]

# 3. Swapping real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]

# 4. Renaming the column to indicate it no longer holds real names
df.rename(columns={'real_name': 'anon_name'}, inplace=True)

As shown above, Mimesis’ Person class includes dedicated methods for generating full names, email addresses, and telephone numbers, among other things. Additionally, the name column is renamed to make it clear that the names in the updated dataset are anonymized rather than real.

We now check the results by examining the modified DataFrame. The sensitive PII fields have been entirely replaced — they now contain plausible synthetic data, while the overall structure of the dataset and key information needed for downstream analyses, such as subscription_tier, remain fully preserved.

print("n--- Anonymized Data for Data Science Analyses ---")
print(df.head())

Output:

--- Anonymized Data for Data Science Analyses ---
   user_id         anon_name                    email            phone  
0      101    Anthony Reilly    archived1911@duck.com     +13312271333   
1      102           Kai Day    suspect2087@yahoo.com  +1-205-759-3586   
2      103  Cleveland Osborn     urgent1912@yahoo.com     +13691067988   
3      104       Zack Holder  johnson1881@example.com  +1-574-481-3676   

  subscription_tier  
0           Premium  
1             Basic  
2             Basic  
3        Enterprise

Excellent! With just a few simple steps, we’ve anonymized several sensitive data fields commonly encountered in real-world, production-level data science projects and analyses — all at no cost, thanks to Mimesis being open-source.

To wrap up, here are some best practices and key takeaways for carrying out the anonymization process we just demonstrated:

We overwrote the columns directly in the DataFrame. Depending on your situation, think about whether this is the best approach, or whether you’d prefer to store the anonymized data in a separate DataFrame to avoid accidentally losing the original data.
Mimesis works in a data-consistent manner, ensuring that the generated data aligns with the expected data types.
Using a seed helps maintain consistency in the generated data across multiple runs and supports reproducibility.

# Wrapping Up

In this article, we’ve demonstrated how to use Mimesis — a powerful Python library for generating anonymized and synthetic data — to convert a sensitive production dataset into a version that can be safely used for further analysis without exposing private information such as real people’s PII.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Top Posts

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

“Masking Real-World Data: Secure Anonymization Techniques for Data Science Using Mimesis”

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Trending

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

“Masking Real-World Data: Secure Anonymization Techniques for Data Science Using Mimesis”

# Introduction

# Step-by-Step Guide

# Wrapping Up

Related Posts