Constructing Your Trendy Information Analytics Stack With Python, Parquet, And DuckDB

Picture by Creator

# Introduction

Information analytics has modified lately. The normal strategy of loading every part right into a relational database and operating SQL queries nonetheless works, however it’s typically overkill for some analytical workloads. Storing information in Parquet information and querying them instantly with DuckDB is quicker, easier, and simpler.

On this article, I’ll present you learn how to construct an information analytics stack in Python that makes use of DuckDB to question information saved in Parquet information. We’ll work with a pattern dataset, discover how every part works, and perceive why this strategy may be helpful in your information science tasks.

Yow will discover the code on GitHub.

# Conditions

Earlier than we begin, be sure you have:

Python 3.10 or a later model put in
An understanding of SQL fundamentals and pandas DataFrame operations
Familiarity with information evaluation ideas

Additionally set up the required libraries:

pip set up duckdb pandas pyarrow numpy faker

# Understanding the Really useful Information Analytics Stack

Let’s begin by understanding what every part does and why they work properly collectively.

Parquet is a columnar storage format initially created for the Hadoop ecosystem. Not like row-based codecs like CSV the place every line is an entire file, Parquet organizes information by columns. This may seem to be a small distinction, however it has large implications for analytics.

Whenever you run a question that solely wants three columns from a desk with fifty columns, Parquet allows you to learn solely these three columns. With CSV, it’s a must to learn each row fully after which throw away the 47 columns you don’t want. This makes Parquet quicker for typical analytical queries. Moreover, columnar storage compresses properly as a result of values in the identical column are typically related.

DuckDB is an embedded analytical database. Whereas SQLite is optimized for transactional workloads requiring many small reads and writes, DuckDB is designed particularly for analytical queries that require scanning giant quantities of knowledge, aggregations, and joins. The embedded half means it runs inside your Python course of, so there is no such thing as a separate database server to put in or handle.

What makes DuckDB particular for analytics is that it will probably question Parquet information instantly. You don’t want to import information into the database first. Level DuckDB at a Parquet file, write SQL, and it reads solely what it wants. This “query in place” functionality is what makes the entire stack helpful.

You need to use this in your Python growth surroundings. You retailer information in Parquet information, pandas handles information manipulation, DuckDB executes analytical queries, and your entire Python ecosystem is obtainable for visualization, machine studying, and automation.

# Making a Pattern Dataset

We’ll use an e-commerce dataset. You need to use the data_generator.py script to generate the pattern dataset or comply with this pocket book.

The dataset consists of clients who place orders, orders that comprise a number of objects, and merchandise with classes and pricing.

The info has referential integrity. Each order references a sound buyer, and each order merchandise references each a sound order and product. This lets us carry out significant joins and aggregations.

# Saving Information to a Parquet File

Earlier than we save our information, let’s perceive why Parquet is efficient for analytics. We’ve got already mentioned some great benefits of columnar storage codecs like Parquet, however let’s go over it once more, this time in larger element.

In a CSV file, information is saved row by row. When you’ve got 1,000,000 rows with 50 columns every, and also you wish to analyze only one column, you continue to have to learn all 50 million values to skip previous the columns you don’t want. That is wasteful.

Parquet, as we now know, shops information column by column. All values for one column are saved collectively. Whenever you question one column, you learn precisely that column and nothing else. For analytical queries that sometimes contact a small variety of columns, that is a lot quicker.

Columnar storage additionally compresses higher. Values in the identical column are typically related — they normally are all integers, all dates, or all from the identical categorical set. Compression algorithms work a lot better on related information than random information.

Let’s save our information as Parquet and see the advantages:

# Save tables as Parquet information
customers_df.to_parquet('clients.parquet', engine="pyarrow", compression='snappy')
products_df.to_parquet('merchandise.parquet', engine="pyarrow", compression='snappy')
orders_df.to_parquet('orders.parquet', engine="pyarrow", compression='snappy')
order_items_df.to_parquet('order_items.parquet', engine="pyarrow", compression='snappy')

# Evaluate with CSV to see the distinction
customers_df.to_csv('clients.csv', index=False)
orders_df.to_csv('orders.csv', index=False)

import os

def get_size_mb(filename):
    return os.path.getsize(filename) / (1024 * 1024)

print("Storage Comparison:")
print(f"customers.csv:     {get_size_mb('customers.csv'):.2f} MB")
print(f"customers.parquet: {get_size_mb('customers.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('customers.parquet')/get_size_mb('customers.csv'))*100:.1f}%n")

print(f"orders.csv:        {get_size_mb('orders.csv'):.2f} MB")
print(f"orders.parquet:    {get_size_mb('orders.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('orders.parquet')/get_size_mb('orders.csv'))*100:.1f}%")

Output:

Storage Comparability:
clients.csv:     0.73 MB
clients.parquet: 0.38 MB
Financial savings: 48.5%

orders.csv:        3.01 MB
orders.parquet:    1.25 MB
Financial savings: 58.5%

These compression ratios are typical. Parquet generally achieves higher compression in comparison with CSV. The compression we’re utilizing right here is Snappy, which prioritizes pace over most compression.

Be aware: Parquet helps different codecs like Gzip, which presents higher compression however is slower, and Zstd for a very good steadiness between compression and pace.

# Querying Parquet Information with DuckDB

Now comes the fascinating half. We are able to question these Parquet information instantly utilizing SQL with out loading them right into a database first.

import duckdb

# Create a DuckDB connection
con = duckdb.join(database=":memory:")

# Question the Parquet file instantly
question = """
SELECT
    customer_segment,
    COUNT(*) as num_customers,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as share
FROM 'clients.parquet'
GROUP BY customer_segment
ORDER BY num_customers DESC
"""

outcome = con.execute(question).fetchdf()
print("Customer Distribution:")
print(outcome)

Output:

Buyer Distribution:
  customer_segment  num_customers  share
0          Normal           5070       50.70
1             Primary           2887       28.87
2           Premium           2043       20.43

Take a look at the question syntax: FROM 'clients.parquet'. DuckDB reads the file instantly. There isn’t any import step, no CREATE TABLE assertion, and no ready for information to load. You write SQL, DuckDB figures out what information it wants from the file, and returns outcomes.

In conventional workflows, you would want to create a database, outline schemas, import information, create indexes, after which lastly question. With DuckDB and Parquet, you skip all that. Below the hood, DuckDB reads the Parquet file metadata to know the schema, then makes use of predicate pushdown to skip studying information that doesn’t match your WHERE clause. It solely reads the columns your question really makes use of. For big information, this makes queries tremendous quick.

# Performing Complicated Analytics

Let’s run a barely extra advanced analytical question. We’ll analyze month-to-month income traits damaged down by buyer phase.

question = """
SELECT
    strftime(o.order_date, '%Y-%m') as month,
    c.customer_segment,
    COUNT(DISTINCT o.order_id) as num_orders,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    ROUND(SUM(o.order_total), 2) as total_revenue,
    ROUND(AVG(o.order_total), 2) as avg_order_value
FROM 'orders.parquet' AS o
JOIN 'clients.parquet' AS c
  ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY month, c.customer_segment
ORDER BY month DESC, total_revenue DESC
LIMIT 15
"""

monthly_revenue = con.execute(question).fetchdf()
print("Recent Monthly Revenue by Segment:")
print(monthly_revenue.to_string(index=False))

Output:

Latest Month-to-month Income by Phase:
  month customer_segment  num_orders  unique_customers  total_revenue  avg_order_value
2026-01          Normal        2600              1468     1683223.68           647.39
2026-01             Primary        1585               857     1031126.44           650.55
2026-01           Premium         970               560      914105.61           942.38
2025-12          Normal        2254              1571     1533076.22           680.16
2025-12           Premium         885               613      921775.85          1041.55
2025-12             Primary        1297               876      889270.86           685.64
2025-11          Normal        1795              1359     1241006.08           691.37
2025-11           Premium         725               554      717625.75           989.83
2025-11             Primary        1012               767      682270.44           674.18
2025-10          Normal        1646              1296     1118400.61           679.47
2025-10           Premium         702               550      695913.24           991.33
2025-10             Primary         988               769      688428.86           696.79
2025-09          Normal        1446              1181      970017.17           670.83
2025-09           Premium         594               485      577486.81           972.20
2025-09             Primary         750               618      495726.69           660.97

This question teams by two dimensions (month and phase), aggregates a number of metrics, and filters on fee standing. It’s the type of question you’d write continuously in analytical work. The strftime perform codecs dates instantly in SQL. The ROUND perform cleans up decimal locations. A number of aggregations run effectively and provides the anticipated outcomes.

# Becoming a member of A number of Tables

Actual analytics not often includes a single desk. Let’s be a part of our tables to reply a enterprise query: which product classes generate probably the most income, and the way does this differ by buyer phase?

question = """
SELECT
    p.class,
    c.customer_segment,
    COUNT(DISTINCT oi.order_id) as num_orders,
    SUM(oi.amount) as units_sold,
    ROUND(SUM(oi.item_total), 2) as total_revenue,
    ROUND(AVG(oi.item_total), 2) as avg_item_value
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
JOIN 'clients.parquet' c ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY p.class, c.customer_segment
ORDER BY total_revenue DESC
LIMIT 20
"""

category_analysis = con.execute(question).fetchdf()
print("Revenue by Category and Customer Segment:")
print(category_analysis.to_string(index=False))

Truncated output:

Income by Class and Buyer Phase:
     class customer_segment  num_orders  units_sold  total_revenue  avg_item_value
  Electronics          Normal        4729      6431.0     6638814.75         1299.18
  Electronics           Premium        2597      3723.0     3816429.62         1292.39
  Electronics             Primary        2685      3566.0     3585652.92         1240.28
   Automotive          Normal        4506      5926.0     3050679.12          633.18
       Sports activities          Normal        5049      6898.0     2745487.54          497.55
...
...
     Clothes           Premium        3028      4342.0      400704.25          114.55
     Clothes             Primary        3102      4285.0      400391.18          117.49
        Books          Normal        6196      8511.0      252357.39           36.74

This question joins three tables. DuckDB routinely determines the optimum be a part of order and execution technique. Discover how readable the SQL is in comparison with equal pandas code. For advanced analytical logic, SQL typically expresses intent extra clearly than DataFrame operations.

# Understanding Question Efficiency

Let’s examine DuckDB with pandas for a typical analytical process.

// Technique 1: Utilizing Pandas

import time

# Analytical process: Calculate buyer buy patterns
print("Performance Comparison: Customer Purchase Analysisn")

start_time = time.time()

# Merge dataframes
merged = order_items_df.merge(orders_df, on='order_id')
merged = merged.merge(products_df, on='product_id')

# Filter accomplished orders
accomplished = merged[merged['payment_status'] == 'accomplished']

# Group and mixture
customer_patterns = accomplished.groupby('customer_id').agg({
    'order_id': 'nunique',
    'product_id': 'nunique',
    'item_total': ['sum', 'mean'],
    'class': lambda x: x.mode()[0] if len(x) > 0 else None
})

customer_patterns.columns = ['num_orders', 'unique_products', 'total_spent', 'avg_spent', 'favorite_category']
customer_patterns = customer_patterns.sort_values('total_spent', ascending=False).head(100)

pandas_time = time.time() - start_time

// Technique 2: Utilizing DuckDB

start_time = time.time()

question = """
SELECT
    o.customer_id,
    COUNT(DISTINCT oi.order_id) as num_orders,
    COUNT(DISTINCT oi.product_id) as unique_products,
    ROUND(SUM(oi.item_total), 2) as total_spent,
    ROUND(AVG(oi.item_total), 2) as avg_spent,
    MODE(p.class) as favorite_category
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
WHERE o.payment_status="completed"
GROUP BY o.customer_id
ORDER BY total_spent DESC
LIMIT 100
"""

duckdb_result = con.execute(question).fetchdf()
duckdb_time = time.time() - start_time

print(f"Pandas execution time:  {pandas_time:.4f} seconds")
print(f"DuckDB execution time:  {duckdb_time:.4f} seconds")
print(f"Speedup: {pandas_time/duckdb_time:.1f}x faster with DuckDBn")

print("Top 5 customers by total spent:")
print(duckdb_result.head().to_string(index=False))

Output:

Efficiency Comparability: Buyer Buy Evaluation

Pandas execution time:  1.9872 seconds
DuckDB execution time:  0.1171 seconds
Speedup: 17.0x quicker with DuckDB

High 5 clients by complete spent:
 customer_id  num_orders  unique_products  total_spent  avg_spent favorite_category
        8747           8               24     21103.21     879.30       Electronics
         617           9               27     19596.22     725.79       Electronics
        2579           9               18     17011.30     895.33            Sports activities
        6242           7               23     16781.11     729.61       Electronics
        5443           8               22     16697.02     758.96        Automotive

DuckDB is about 17x quicker. This efficiency hole is extra pronounced with bigger datasets. The pandas strategy hundreds all information into reminiscence, performs a number of merge operations (which create copies), after which aggregates. DuckDB reads instantly from Parquet information, pushes filters right down to keep away from studying pointless information, and makes use of optimized be a part of algorithms.

# Constructing Reusable Analytics Queries

In manufacturing analytics, you’ll run related queries repeatedly with completely different parameters. Let’s construct a reusable perform that follows finest practices for this workflow.

def analyze_product_performance(con, class=None, min_revenue=None, date_from=None, top_n=20):
    """
    Analyze product efficiency with versatile filtering.

    This demonstrates learn how to construct reusable analytical queries that may be
    parameterized for various use instances. In manufacturing, you'd construct a library
    of those features for widespread analytical questions.
    """

    # Construct the WHERE clause dynamically based mostly on parameters
    where_clauses = ["o.payment_status="completed""]

    if class:
        where_clauses.append(f"p.category = '{category}'")

    if date_from:
        where_clauses.append(f"o.order_date >= '{date_from}'")

    where_clause = " AND ".be a part of(where_clauses)

    # Important analytical question
    question = f"""
    WITH product_metrics AS (
        SELECT
            p.product_id,
            p.product_name,
            p.class,
            p.base_price,
            COUNT(DISTINCT oi.order_id) as times_ordered,
            SUM(oi.amount) as units_sold,
            ROUND(SUM(oi.item_total), 2) as total_revenue,
            ROUND(AVG(oi.unit_price), 2) as avg_selling_price,
            ROUND(SUM(oi.item_total) - (p.value * SUM(oi.amount)), 2) as revenue
        FROM 'order_items.parquet' oi
        JOIN 'orders.parquet' o ON oi.order_id = o.order_id
        JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
        WHERE {where_clause}
        GROUP BY p.product_id, p.product_name, p.class, p.base_price, p.value
    )
    SELECT
        *,
        ROUND(100.0 * revenue / total_revenue, 2) as profit_margin_pct,
        ROUND(avg_selling_price / base_price, 2) as price_realization
    FROM product_metrics
    """

    # Add income filter if specified
    if min_revenue:
        question += f" WHERE total_revenue >= {min_revenue}"

    question += f"""
    ORDER BY total_revenue DESC
    LIMIT {top_n}
    """

    return con.execute(question).fetchdf()

This perform does the next. First, it builds SQL dynamically based mostly on parameters, permitting versatile filtering with out writing separate queries for every case. Second, it makes use of a Widespread Desk Expression (CTE) to prepare advanced logic into readable steps. Third, it calculates derived metrics like revenue margin and worth realization that require a number of supply columns.

The revenue calculation subtracts prices from income utilizing information from each the order objects and merchandise tables. This sort of cross-table calculation is simple in SQL however can be cumbersome with a number of pandas operations. DuckDB handles it effectively in a single question.

Right here is an instance that makes use of the perform above:

# Instance 1: High electronics merchandise
electronics = analyze_product_performance(con, class='Electronics', top_n=10)
print("Top 10 Electronics Products:")
print(electronics[['product_name', 'units_sold', 'total_revenue', 'profit_margin_pct']].to_string(index=False))

Output:

High 10 Electronics Merchandise:
        product_name  units_sold  total_revenue  profit_margin_pct
Electronics Merchandise 113       262.0      510331.81              38.57
Electronics Merchandise 154       289.0      486307.74              38.28
Electronics Merchandise 122       229.0      448680.64              38.88
Electronics Merchandise 472       251.0      444680.20              38.51
Electronics Merchandise 368       222.0      424057.14              38.96
Electronics Merchandise 241       219.0      407648.10              38.75
Electronics Merchandise 410       243.0      400078.65              38.31
Electronics Merchandise 104       233.0      400036.84              38.73
  Electronics Merchandise 2       213.0      382583.85              38.76
Electronics Merchandise 341       240.0      376722.94              38.94

And right here is one other instance:

# Instance 2: Excessive-revenue merchandise throughout all classes
print("nnHigh-Revenue Products (>$50k revenue):")
high_revenue = analyze_product_performance(con, min_revenue=50000, top_n=10)
print(high_revenue[['product_name', 'category', 'total_revenue', 'profit']].to_string(index=False))

Output:

Excessive-Income Merchandise (>$50k income):
        product_name     class  total_revenue     revenue
Electronics Merchandise 113 Electronics      510331.81  196846.19
Electronics Merchandise 154 Electronics      486307.74  186140.78
Electronics Merchandise 122 Electronics      448680.64  174439.40
Electronics Merchandise 472 Electronics      444680.20  171240.80
Electronics Merchandise 368 Electronics      424057.14  165194.04
Electronics Merchandise 241 Electronics      407648.10  157955.25
Electronics Merchandise 410 Electronics      400078.65  153270.84
Electronics Merchandise 104 Electronics      400036.84  154953.46
  Electronics Merchandise 2 Electronics      382583.85  148305.15
Electronics Merchandise 341 Electronics      376722.94  146682.94

# Wrapping Up

On this article, we analyzed e-commerce information. We generated relational information, saved it as Parquet, and queried it with DuckDB. The efficiency comparisons confirmed substantial speedups in comparison with conventional pandas approaches.

Use this stack if you end up performing analytical workloads on structured information. In case you are aggregating, filtering, becoming a member of, and computing metrics, that is helpful. It really works properly for information that adjustments in batches fairly than continuously. In case you are analyzing yesterday’s gross sales, processing month-to-month studies, or exploring historic traits, Parquet information up to date periodically work nice. You don’t want a stay database continuously accepting writes.

Nevertheless, this stack is just not appropriate for every part:

When you want real-time updates with many concurrent writers, you want a standard database with ACID transactions
In case you are constructing an software with user-facing queries that want millisecond response occasions, an listed database is healthier
If a number of customers want to question concurrently with completely different entry permissions, a database server supplies higher management

The candy spot is analytical work on giant datasets, the place information updates occur in batches and also you want quick, versatile querying and evaluation.

Blissful analyzing!

Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embody DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and low! At the moment, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates partaking useful resource overviews and coding tutorials.

Top Posts

Exposing Spin apps on SpinKube with GatewayAPI

The wearable ecosystem trusted by Olympic athletes

Why final 12 months’s LG C5 OLED is the neatest TV purchase proper now – particularly at 50% off

Constructing Your Trendy Information Analytics Stack with Python, Parquet, and DuckDB

Why final 12 months’s LG C5 OLED is the neatest TV purchase proper now – particularly at 50% off

Breaking the Host Reminiscence Bottleneck: How Peer Direct Remodeled Gaudi’s Cloud Efficiency

Tailscale and LM Studio Introduce ‘LM Hyperlink’ to Present Encrypted Level-to-Level Entry to Your Non-public GPU {Hardware} Belongings

A check-in with the workplace accountable for working the Capitol constructing

Samsung Galaxy S26 Extremely vs. iPhone 17 Professional Max: Which premium flagship cellphone wins?

Constructing resilient operational expertise by zero belief

Exposing Spin apps on SpinKube with GatewayAPI

The wearable ecosystem trusted by Olympic athletes

Why final 12 months’s LG C5 OLED is the neatest TV purchase proper now – particularly at 50% off

BTC’s worth bounce fails to persuade choices merchants: Crypto Daybook Americas

Microsoft Warns Builders of Faux Subsequent.js Job Repos Delivering In-Reminiscence Malware

High 7 OpenClaw Instruments & Integrations You Are Lacking Out On

Good authorities group questions particulars of proposed SES reforms

Aeris, Verizon Enterprise Streamline International IoT Connectivity

Trending

Exposing Spin apps on SpinKube with GatewayAPI

The wearable ecosystem trusted by Olympic athletes

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Constructing Your Trendy Information Analytics Stack with Python, Parquet, and DuckDB

# Introduction

# Conditions

# Understanding the Really useful Information Analytics Stack

# Making a Pattern Dataset

# Saving Information to a Parquet File

# Querying Parquet Information with DuckDB

# Performing Complicated Analytics

# Becoming a member of A number of Tables

# Understanding Question Efficiency

// Technique 1: Utilizing Pandas

// Technique 2: Utilizing DuckDB

# Constructing Reusable Analytics Queries

# Wrapping Up

Related Posts