5 Helpful Python Scripts For Artificial Knowledge Era

Picture by Editor

# Introduction

Artificial information, because the identify suggests, is created artificially relatively than being collected from real-world sources. It appears to be like like actual information however avoids privateness points and excessive information assortment prices. This lets you simply take a look at software program and fashions whereas operating experiments to simulate efficiency after launch.

Whereas libraries like Faker, SDV, and SynthCity exist — and even giant language fashions (LLMs) are extensively used for producing artificial information — my focus on this article is to keep away from counting on these exterior libraries or AI instruments. As an alternative, you’ll learn to obtain the identical outcomes by writing your personal Python scripts. This supplies a greater understanding of learn how to form a dataset and the way biases or errors are launched. We are going to begin with easy toy scripts to grasp the out there choices. When you grasp these fundamentals, you’ll be able to comfortably transition to specialised libraries.

# 1. Producing Easy Random Knowledge

The only place to start out is with a desk. For instance, for those who want a faux buyer dataset for an inside demo, you’ll be able to run a script to generate comma-separated values (CSV) information:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

nations = ["Canada", "UK", "UAE", "Germany", "USA"]
plans = ["Free", "Basic", "Pro", "Enterprise"]

def random_signup_date():
    begin = datetime(2024, 1, 1)
    finish = datetime(2026, 1, 1)
    delta_days = (finish - begin).days
    return (begin + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    nation = random.selection(nations)
    plan = random.selection(plans)
    monthly_spend = spherical(random.uniform(0, 500), 2)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "country": nation,
        "plan": plan,
        "monthly_spend": monthly_spend,
        "signup_date": random_signup_date()
    })

with open("customers.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved customers.csv")

Output:

This script is easy: you outline fields, select ranges, and write rows. The random module helps integer technology, floating-point values, random selection, and sampling. The csv module is designed to learn and write row-based tabular information. This type of dataset is appropriate for:

Frontend demos
Dashboard testing
API improvement
Studying Structured Question Language (SQL)
Unit testing enter pipelines

Nevertheless, there’s a major weak spot to this method: all the pieces is totally random. This typically ends in information that appears flat or unnatural. Enterprise prospects would possibly spend solely 2 {dollars}, whereas “Free” customers would possibly spend 400. Older customers behave precisely like youthful ones as a result of there isn’t any underlying construction.

In real-world situations, information hardly ever behaves this manner. As an alternative of producing values independently, we will introduce relationships and guidelines. This makes the dataset really feel extra lifelike whereas remaining absolutely artificial. As an illustration:

Enterprise prospects ought to nearly by no means have zero spend
Spending ranges ought to rely upon the chosen plan
Older customers would possibly spend barely extra on common
Sure plans needs to be extra widespread than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = ["Free", "Basic", "Pro", "Enterprise"]

def choose_plan():
    roll = random.random()
    if roll < 0.45:
        return "Free"
    if roll < 0.75:
        return "Basic"
    if roll < 0.93:
        return "Pro"
    return "Enterprise"

def generate_spend(age, plan):
    if plan == "Free":
        base = random.uniform(0, 10)
    elif plan == "Basic":
        base = random.uniform(10, 60)
    elif plan == "Pro":
        base = random.uniform(50, 180)
    else:
        base = random.uniform(150, 500)

    if age >= 40:
        base *= 1.15

    return spherical(base, 2)

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    plan = choose_plan()
    spend = generate_spend(age, plan)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "plan": plan,
        "monthly_spend": spend
    })

with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved controlled_customers.csv")

Output:

Now the dataset preserves significant patterns. Slightly than producing random noise, you’re simulating behaviors. Efficient controls could embody:

Weighted class choice
Lifelike minimal and most ranges
Conditional logic between columns
Deliberately added uncommon edge circumstances
Lacking values inserted at low charges
Correlated options as a substitute of impartial ones

# 2. Simulating Processes for Artificial Knowledge

Simulation-based technology is without doubt one of the greatest methods to create lifelike artificial datasets. As an alternative of immediately filling columns, you simulate a course of. For instance, think about a small warehouse the place orders arrive, inventory decreases, and low inventory ranges set off backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

stock = {
    "A": 120,
    "B": 80,
    "C": 50
}

rows = []
current_time = datetime(2026, 1, 1)

for day in vary(30):
    for product in stock:
        daily_orders = random.randint(0, 12)

        for _ in vary(daily_orders):
            qty = random.randint(1, 5)
            earlier than = stock[product]

            if stock[product] >= qty:
                stock[product] -= qty
                standing = "fulfilled"
            else:
                standing = "backorder"

            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": qty,
                "stock_before": earlier than,
                "stock_after": stock[product],
                "status": standing
            })

        if stock[product] < 20:
            restock = random.randint(30, 80)
            stock[product] += restock
            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": restock,
                "stock_before": stock[product] - restock,
                "stock_after": stock[product],
                "status": "restock"
            })

    current_time += timedelta(days=1)

with open("warehouse_sim.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved warehouse_sim.csv")

Output:

This methodology is great as a result of the info is a byproduct of system habits, which generally yields extra lifelike relationships than direct random row technology. Different simulation concepts embody:

Name middle queues
Journey requests and driver matching
Mortgage functions and approvals
Subscriptions and churn
Affected person appointment flows
Web site site visitors and conversion

# 3. Producing Time Collection Artificial Knowledge

Artificial information is not only restricted to static tables. Many programs produce sequences over time, resembling app site visitors, sensor readings, orders per hour, or server response occasions. Right here is an easy time collection generator for hourly web site visits with weekday patterns.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

begin = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []

for i in vary(hours):
    ts = begin + timedelta(hours=i)
    weekday = ts.weekday()

    base = 120
    if weekday >= 5:
        base = 80

    hour = ts.hour
    if 8 <= hour <= 11:
        base += 60
    elif 18 <= hour <= 21:
        base += 40
    elif 0 <= hour <= 5:
        base -= 30

    visits = max(0, int(random.gauss(base, 15)))

    rows.append({
        "timestamp": ts.isoformat(),
        "visits": visits
    })

with open("traffic_timeseries.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=["timestamp", "visits"])
    author.writeheader()
    author.writerows(rows)

print("Saved traffic_timeseries.csv")

Output:

This method works properly as a result of it incorporates tendencies, noise, and cyclic habits whereas remaining straightforward to elucidate and debug.

# 4. Creating Occasion Logs

Occasion logs are one other helpful script fashion, preferrred for product analytics and workflow testing. As an alternative of 1 row per buyer, you create one row per motion.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

occasions = ["signup", "login", "view_page", "add_to_cart", "purchase", "logout"]

rows = []
begin = datetime(2026, 1, 1)

for user_id in vary(1, 201):
    event_count = random.randint(5, 30)
    current_time = begin + timedelta(days=random.randint(0, 10))

    for _ in vary(event_count):
        occasion = random.selection(occasions)

        if occasion == "purchase" and random.random() < 0.6:
            worth = spherical(random.uniform(10, 300), 2)
        else:
            worth = 0.0

        rows.append({
            "user_id": f"USER{user_id:04d}",
            "event_time": current_time.isoformat(),
            "event_name": occasion,
            "event_value": worth
        })

        current_time += timedelta(minutes=random.randint(1, 180))

with open("event_log.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved event_log.csv")

Output:

This format is helpful for:

Funnel evaluation
Analytics pipeline testing
Enterprise intelligence (BI) dashboards
Session reconstruction
Anomaly detection experiments

A helpful method right here is to make occasions depending on earlier actions. For instance, a purchase order ought to sometimes observe a login or a web page view, making the artificial log extra plausible.

# 5. Producing Artificial Textual content Knowledge with Templates

Artificial information can also be useful for pure language processing (NLP). You don’t all the time want an LLM to start out; you’ll be able to construct efficient textual content datasets utilizing templates and managed variation. For instance, you’ll be able to create assist ticket coaching information:

import json
import random

random.seed(42)

points = [
    ("billing", "I was charged twice for my subscription"),
    ("login", "I cannot log into my account"),
    ("shipping", "My order has not arrived yet"),
    ("refund", "I want to request a refund"),
]

tones = ["Please help", "This is urgent", "Can you check this", "I need support"]

data = []

for _ in vary(100):
    label, message = random.selection(points)
    tone = random.selection(tones)

    textual content = f"{tone}. {message}."
    data.append({
        "text": textual content,
        "label": label
    })

with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
    for merchandise in data:
        f.write(json.dumps(merchandise) + "n")

print("Saved support_tickets.jsonl")

Output:

This method works properly for:

Textual content classification demos
Intent detection
Chatbot testing
Immediate analysis

# Ultimate Ideas

Artificial information scripts are highly effective instruments, however they are often applied incorrectly. Make sure to keep away from these widespread errors:

Making all values uniformly random
Forgetting dependencies between fields
Producing values that violate enterprise logic
Assuming artificial information is inherently protected by default
Creating information that’s too “clean” to be helpful for testing real-world edge circumstances
Utilizing the identical sample so often that the dataset turns into predictable and unrealistic

Privateness stays probably the most crucial consideration. Whereas artificial information reduces publicity to actual data, it isn’t risk-free. If a generator is simply too carefully tied to authentic delicate information, leakage can nonetheless happen. That is why privacy-preserving strategies, resembling differentially personal artificial information, are important.

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e book “Maximizing Productivity with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.

Top Posts

From Day 1 to Day 2: Constructing IoT fleets that keep linked, keep optimised and keep safe.

Invoice Good on Automation, Digitization and Constructing the No. 1 U.S. Equipment Producer

Past Immediate Caching: 5 Extra Issues You Ought to Cache in RAG Pipelines

5 Helpful Python Scripts for Artificial Knowledge Era

The Fundamentals of Vibe Engineering

Synthetic intelligence-guided design of LNPs for in vivo focused mRNA supply by way of evaluation of the spatial conformation of ionizable lipids

Baidu Qianfan Workforce Releases Qianfan-OCR: A 4B-Parameter Unified Doc Intelligence Mannequin

How RFID Information Is Reworking Logistics Again-Workplace Accuracy

Two-Stage Hurdle Fashions: Predicting Zero-Inflated Outcomes

Visualizing Patterns in Options: How Information Construction Impacts Coding Fashion

From Day 1 to Day 2: Constructing IoT fleets that keep linked, keep optimised and keep safe.

Invoice Good on Automation, Digitization and Constructing the No. 1 U.S. Equipment Producer

Past Immediate Caching: 5 Extra Issues You Ought to Cache in RAG Pipelines

Decentralized Confidential Computing: The Privateness Layer for an AI‑Native, Onchain World

7 Methods to Stop Privilege Escalation through Password Resets

The Fundamentals of Vibe Engineering

The message from Maryland: dropping a federal job doesn’t need to imply leaving the area

WBA Publishes Business First Steering on AI, ML for Clever Wi-Fi

Trending

From Day 1 to Day 2: Constructing IoT fleets that keep linked, keep optimised and keep safe.

Invoice Good on Automation, Digitization and Constructing the No. 1 U.S. Equipment Producer

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

5 Helpful Python Scripts for Artificial Knowledge Era

# Introduction

# 1. Producing Easy Random Knowledge

# 2. Simulating Processes for Artificial Knowledge

# 3. Producing Time Collection Artificial Knowledge

# 4. Creating Occasion Logs

# 5. Producing Artificial Textual content Knowledge with Templates

# Ultimate Ideas

Related Posts