# Introduction
Production data is often bound by strict privacy and compliance regulations. As a result, anonymizing this data is essential in nearly every real-world data science initiative that involves deploying a data-driven product, service, or solution.
Mimesis is an open-source Python library known for its ability to quickly generate realistic synthetic data. It runs entirely on your local machine and offers a free, reliable solution for building data pipelines. This guide will walk you through how to use this library to anonymize sensitive production data, using a practical, step-by-step example that you can easily replicate in your IDE or a notebook.
# Step-by-Step Guide
If you’re just getting started with Mimesis, you’ll first need to install it in your Python environment using a command like:
Don’t forget to prepend ! to the pip command if you’re working in a Google Colab notebook or a similar environment.
Now we’re all set! Let’s imagine a scenario involving a software product’s subscription system with different tiers. For simplicity, we’ll create a small synthetic dataset containing customer information and their subscription levels. Some of the fields in this dataset contain highly sensitive data, as shown below:
import pandas as pd
# Building a mock "production" customer dataset
production_data = {
'user_id': [101, 102, 103, 104],
'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}
df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())In our example, subscription tiers aren’t considered sensitive, but user names, emails, and phone numbers certainly are. Using Mimesis, we can set up a provider — essentially a customized data anonymization template designed for the type of data we’re working with. Since our records relate to individuals, we can import and use the Person class — a provider that, when given a specific language such as English and a random seed, can generate convincing replacements for real, sensitive personal information:
from mimesis import Person
from mimesis.locales import Locale
# Setting up a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)From here on, anonymizing personally identifiable information (PII) is straightforward. All you need to do is swap out the sensitive columns — which you specify — with newly generated data from the Mimesis person locale generator. This is accomplished by looping through the DataFrame that holds the entire dataset and invoking the appropriate Mimesis functions to create realistic replacements for each attribute:
# 1. Swapping real names with realistic fake names
df['real_name'] = [person.full_name() for _ in range(len(df))]
# 2. Swapping real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]
# 3. Swapping real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]
# 4. Renaming the column to indicate it no longer holds real names
df.rename(columns={'real_name': 'anon_name'}, inplace=True)As shown above, Mimesis’ Person class includes dedicated methods for generating full names, email addresses, and telephone numbers, among other things. Additionally, the name column is renamed to make it clear that the names in the updated dataset are anonymized rather than real.
We now check the results by examining the modified DataFrame. The sensitive PII fields have been entirely replaced — they now contain plausible synthetic data, while the overall structure of the dataset and key information needed for downstream analyses, such as subscription_tier, remain fully preserved.
print("n--- Anonymized Data for Data Science Analyses ---")
print(df.head())Output:
--- Anonymized Data for Data Science Analyses ---
user_id anon_name email phone
0 101 Anthony Reilly archived1911@duck.com +13312271333
1 102 Kai Day suspect2087@yahoo.com +1-205-759-3586
2 103 Cleveland Osborn urgent1912@yahoo.com +13691067988
3 104 Zack Holder johnson1881@example.com +1-574-481-3676
subscription_tier
0 Premium
1 Basic
2 Basic
3 Enterprise Excellent! With just a few simple steps, we’ve anonymized several sensitive data fields commonly encountered in real-world, production-level data science projects and analyses — all at no cost, thanks to Mimesis being open-source.
To wrap up, here are some best practices and key takeaways for carrying out the anonymization process we just demonstrated:
- We overwrote the columns directly in the
DataFrame. Depending on your situation, think about whether this is the best approach, or whether you’d prefer to store the anonymized data in a separateDataFrameto avoid accidentally losing the original data. - Mimesis works in a data-consistent manner, ensuring that the generated data aligns with the expected data types.
- Using a seed helps maintain consistency in the generated data across multiple runs and supports reproducibility.
# Wrapping Up
In this article, we’ve demonstrated how to use Mimesis — a powerful Python library for generating anonymized and synthetic data — to convert a sensitive production dataset into a version that can be safely used for further analysis without exposing private information such as real people’s PII.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.



