Picture by Creator
# Introduction
Information analytics has modified lately. The normal strategy of loading every part right into a relational database and operating SQL queries nonetheless works, however it’s typically overkill for some analytical workloads. Storing information in Parquet information and querying them instantly with DuckDB is quicker, easier, and simpler.
On this article, I’ll present you learn how to construct an information analytics stack in Python that makes use of DuckDB to question information saved in Parquet information. We’ll work with a pattern dataset, discover how every part works, and perceive why this strategy may be helpful in your information science tasks.
Yow will discover the code on GitHub.
# Conditions
Earlier than we begin, be sure you have:
- Python 3.10 or a later model put in
- An understanding of SQL fundamentals and pandas DataFrame operations
- Familiarity with information evaluation ideas
Additionally set up the required libraries:
pip set up duckdb pandas pyarrow numpy faker
# Understanding the Really useful Information Analytics Stack
Let’s begin by understanding what every part does and why they work properly collectively.
Parquet is a columnar storage format initially created for the Hadoop ecosystem. Not like row-based codecs like CSV the place every line is an entire file, Parquet organizes information by columns. This may seem to be a small distinction, however it has large implications for analytics.
Whenever you run a question that solely wants three columns from a desk with fifty columns, Parquet allows you to learn solely these three columns. With CSV, it’s a must to learn each row fully after which throw away the 47 columns you don’t want. This makes Parquet quicker for typical analytical queries. Moreover, columnar storage compresses properly as a result of values in the identical column are typically related.
DuckDB is an embedded analytical database. Whereas SQLite is optimized for transactional workloads requiring many small reads and writes, DuckDB is designed particularly for analytical queries that require scanning giant quantities of knowledge, aggregations, and joins. The embedded half means it runs inside your Python course of, so there is no such thing as a separate database server to put in or handle.
What makes DuckDB particular for analytics is that it will probably question Parquet information instantly. You don’t want to import information into the database first. Level DuckDB at a Parquet file, write SQL, and it reads solely what it wants. This “query in place” functionality is what makes the entire stack helpful.
You need to use this in your Python growth surroundings. You retailer information in Parquet information, pandas handles information manipulation, DuckDB executes analytical queries, and your entire Python ecosystem is obtainable for visualization, machine studying, and automation.
# Making a Pattern Dataset
We’ll use an e-commerce dataset. You need to use the data_generator.py script to generate the pattern dataset or comply with this pocket book.
The dataset consists of clients who place orders, orders that comprise a number of objects, and merchandise with classes and pricing.
The info has referential integrity. Each order references a sound buyer, and each order merchandise references each a sound order and product. This lets us carry out significant joins and aggregations.
# Saving Information to a Parquet File
Earlier than we save our information, let’s perceive why Parquet is efficient for analytics. We’ve got already mentioned some great benefits of columnar storage codecs like Parquet, however let’s go over it once more, this time in larger element.
In a CSV file, information is saved row by row. When you’ve got 1,000,000 rows with 50 columns every, and also you wish to analyze only one column, you continue to have to learn all 50 million values to skip previous the columns you don’t want. That is wasteful.
Parquet, as we now know, shops information column by column. All values for one column are saved collectively. Whenever you question one column, you learn precisely that column and nothing else. For analytical queries that sometimes contact a small variety of columns, that is a lot quicker.
Columnar storage additionally compresses higher. Values in the identical column are typically related — they normally are all integers, all dates, or all from the identical categorical set. Compression algorithms work a lot better on related information than random information.
Let’s save our information as Parquet and see the advantages:
# Save tables as Parquet information
customers_df.to_parquet('clients.parquet', engine="pyarrow", compression='snappy')
products_df.to_parquet('merchandise.parquet', engine="pyarrow", compression='snappy')
orders_df.to_parquet('orders.parquet', engine="pyarrow", compression='snappy')
order_items_df.to_parquet('order_items.parquet', engine="pyarrow", compression='snappy')
# Evaluate with CSV to see the distinction
customers_df.to_csv('clients.csv', index=False)
orders_df.to_csv('orders.csv', index=False)
import os
def get_size_mb(filename):
return os.path.getsize(filename) / (1024 * 1024)
print("Storage Comparison:")
print(f"customers.csv: {get_size_mb('customers.csv'):.2f} MB")
print(f"customers.parquet: {get_size_mb('customers.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('customers.parquet')/get_size_mb('customers.csv'))*100:.1f}%n")
print(f"orders.csv: {get_size_mb('orders.csv'):.2f} MB")
print(f"orders.parquet: {get_size_mb('orders.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('orders.parquet')/get_size_mb('orders.csv'))*100:.1f}%")
Output:
Storage Comparability:
clients.csv: 0.73 MB
clients.parquet: 0.38 MB
Financial savings: 48.5%
orders.csv: 3.01 MB
orders.parquet: 1.25 MB
Financial savings: 58.5%
These compression ratios are typical. Parquet generally achieves higher compression in comparison with CSV. The compression we’re utilizing right here is Snappy, which prioritizes pace over most compression.
Be aware: Parquet helps different codecs like Gzip, which presents higher compression however is slower, and Zstd for a very good steadiness between compression and pace.
# Querying Parquet Information with DuckDB
Now comes the fascinating half. We are able to question these Parquet information instantly utilizing SQL with out loading them right into a database first.
import duckdb
# Create a DuckDB connection
con = duckdb.join(database=":memory:")
# Question the Parquet file instantly
question = """
SELECT
customer_segment,
COUNT(*) as num_customers,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as share
FROM 'clients.parquet'
GROUP BY customer_segment
ORDER BY num_customers DESC
"""
outcome = con.execute(question).fetchdf()
print("Customer Distribution:")
print(outcome)
Output:
Buyer Distribution:
customer_segment num_customers share
0 Normal 5070 50.70
1 Primary 2887 28.87
2 Premium 2043 20.43
Take a look at the question syntax: FROM 'clients.parquet'. DuckDB reads the file instantly. There isn’t any import step, no CREATE TABLE assertion, and no ready for information to load. You write SQL, DuckDB figures out what information it wants from the file, and returns outcomes.
In conventional workflows, you would want to create a database, outline schemas, import information, create indexes, after which lastly question. With DuckDB and Parquet, you skip all that. Below the hood, DuckDB reads the Parquet file metadata to know the schema, then makes use of predicate pushdown to skip studying information that doesn’t match your WHERE clause. It solely reads the columns your question really makes use of. For big information, this makes queries tremendous quick.
# Performing Complicated Analytics
Let’s run a barely extra advanced analytical question. We’ll analyze month-to-month income traits damaged down by buyer phase.
question = """
SELECT
strftime(o.order_date, '%Y-%m') as month,
c.customer_segment,
COUNT(DISTINCT o.order_id) as num_orders,
COUNT(DISTINCT o.customer_id) as unique_customers,
ROUND(SUM(o.order_total), 2) as total_revenue,
ROUND(AVG(o.order_total), 2) as avg_order_value
FROM 'orders.parquet' AS o
JOIN 'clients.parquet' AS c
ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY month, c.customer_segment
ORDER BY month DESC, total_revenue DESC
LIMIT 15
"""
monthly_revenue = con.execute(question).fetchdf()
print("Recent Monthly Revenue by Segment:")
print(monthly_revenue.to_string(index=False))
Output:
Latest Month-to-month Income by Phase:
month customer_segment num_orders unique_customers total_revenue avg_order_value
2026-01 Normal 2600 1468 1683223.68 647.39
2026-01 Primary 1585 857 1031126.44 650.55
2026-01 Premium 970 560 914105.61 942.38
2025-12 Normal 2254 1571 1533076.22 680.16
2025-12 Premium 885 613 921775.85 1041.55
2025-12 Primary 1297 876 889270.86 685.64
2025-11 Normal 1795 1359 1241006.08 691.37
2025-11 Premium 725 554 717625.75 989.83
2025-11 Primary 1012 767 682270.44 674.18
2025-10 Normal 1646 1296 1118400.61 679.47
2025-10 Premium 702 550 695913.24 991.33
2025-10 Primary 988 769 688428.86 696.79
2025-09 Normal 1446 1181 970017.17 670.83
2025-09 Premium 594 485 577486.81 972.20
2025-09 Primary 750 618 495726.69 660.97
This question teams by two dimensions (month and phase), aggregates a number of metrics, and filters on fee standing. It’s the type of question you’d write continuously in analytical work. The strftime perform codecs dates instantly in SQL. The ROUND perform cleans up decimal locations. A number of aggregations run effectively and provides the anticipated outcomes.
# Becoming a member of A number of Tables
Actual analytics not often includes a single desk. Let’s be a part of our tables to reply a enterprise query: which product classes generate probably the most income, and the way does this differ by buyer phase?
question = """
SELECT
p.class,
c.customer_segment,
COUNT(DISTINCT oi.order_id) as num_orders,
SUM(oi.amount) as units_sold,
ROUND(SUM(oi.item_total), 2) as total_revenue,
ROUND(AVG(oi.item_total), 2) as avg_item_value
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
JOIN 'clients.parquet' c ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY p.class, c.customer_segment
ORDER BY total_revenue DESC
LIMIT 20
"""
category_analysis = con.execute(question).fetchdf()
print("Revenue by Category and Customer Segment:")
print(category_analysis.to_string(index=False))
Truncated output:
Income by Class and Buyer Phase:
class customer_segment num_orders units_sold total_revenue avg_item_value
Electronics Normal 4729 6431.0 6638814.75 1299.18
Electronics Premium 2597 3723.0 3816429.62 1292.39
Electronics Primary 2685 3566.0 3585652.92 1240.28
Automotive Normal 4506 5926.0 3050679.12 633.18
Sports activities Normal 5049 6898.0 2745487.54 497.55
...
...
Clothes Premium 3028 4342.0 400704.25 114.55
Clothes Primary 3102 4285.0 400391.18 117.49
Books Normal 6196 8511.0 252357.39 36.74
This question joins three tables. DuckDB routinely determines the optimum be a part of order and execution technique. Discover how readable the SQL is in comparison with equal pandas code. For advanced analytical logic, SQL typically expresses intent extra clearly than DataFrame operations.
# Understanding Question Efficiency
Let’s examine DuckDB with pandas for a typical analytical process.
// Technique 1: Utilizing Pandas
import time
# Analytical process: Calculate buyer buy patterns
print("Performance Comparison: Customer Purchase Analysisn")
start_time = time.time()
# Merge dataframes
merged = order_items_df.merge(orders_df, on='order_id')
merged = merged.merge(products_df, on='product_id')
# Filter accomplished orders
accomplished = merged[merged['payment_status'] == 'accomplished']
# Group and mixture
customer_patterns = accomplished.groupby('customer_id').agg({
'order_id': 'nunique',
'product_id': 'nunique',
'item_total': ['sum', 'mean'],
'class': lambda x: x.mode()[0] if len(x) > 0 else None
})
customer_patterns.columns = ['num_orders', 'unique_products', 'total_spent', 'avg_spent', 'favorite_category']
customer_patterns = customer_patterns.sort_values('total_spent', ascending=False).head(100)
pandas_time = time.time() - start_time
// Technique 2: Utilizing DuckDB
start_time = time.time()
question = """
SELECT
o.customer_id,
COUNT(DISTINCT oi.order_id) as num_orders,
COUNT(DISTINCT oi.product_id) as unique_products,
ROUND(SUM(oi.item_total), 2) as total_spent,
ROUND(AVG(oi.item_total), 2) as avg_spent,
MODE(p.class) as favorite_category
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
WHERE o.payment_status="completed"
GROUP BY o.customer_id
ORDER BY total_spent DESC
LIMIT 100
"""
duckdb_result = con.execute(question).fetchdf()
duckdb_time = time.time() - start_time
print(f"Pandas execution time: {pandas_time:.4f} seconds")
print(f"DuckDB execution time: {duckdb_time:.4f} seconds")
print(f"Speedup: {pandas_time/duckdb_time:.1f}x faster with DuckDBn")
print("Top 5 customers by total spent:")
print(duckdb_result.head().to_string(index=False))
Output:
Efficiency Comparability: Buyer Buy Evaluation
Pandas execution time: 1.9872 seconds
DuckDB execution time: 0.1171 seconds
Speedup: 17.0x quicker with DuckDB
High 5 clients by complete spent:
customer_id num_orders unique_products total_spent avg_spent favorite_category
8747 8 24 21103.21 879.30 Electronics
617 9 27 19596.22 725.79 Electronics
2579 9 18 17011.30 895.33 Sports activities
6242 7 23 16781.11 729.61 Electronics
5443 8 22 16697.02 758.96 Automotive
DuckDB is about 17x quicker. This efficiency hole is extra pronounced with bigger datasets. The pandas strategy hundreds all information into reminiscence, performs a number of merge operations (which create copies), after which aggregates. DuckDB reads instantly from Parquet information, pushes filters right down to keep away from studying pointless information, and makes use of optimized be a part of algorithms.
# Constructing Reusable Analytics Queries
In manufacturing analytics, you’ll run related queries repeatedly with completely different parameters. Let’s construct a reusable perform that follows finest practices for this workflow.
def analyze_product_performance(con, class=None, min_revenue=None, date_from=None, top_n=20):
"""
Analyze product efficiency with versatile filtering.
This demonstrates learn how to construct reusable analytical queries that may be
parameterized for various use instances. In manufacturing, you'd construct a library
of those features for widespread analytical questions.
"""
# Construct the WHERE clause dynamically based mostly on parameters
where_clauses = ["o.payment_status="completed""]
if class:
where_clauses.append(f"p.category = '{category}'")
if date_from:
where_clauses.append(f"o.order_date >= '{date_from}'")
where_clause = " AND ".be a part of(where_clauses)
# Important analytical question
question = f"""
WITH product_metrics AS (
SELECT
p.product_id,
p.product_name,
p.class,
p.base_price,
COUNT(DISTINCT oi.order_id) as times_ordered,
SUM(oi.amount) as units_sold,
ROUND(SUM(oi.item_total), 2) as total_revenue,
ROUND(AVG(oi.unit_price), 2) as avg_selling_price,
ROUND(SUM(oi.item_total) - (p.value * SUM(oi.amount)), 2) as revenue
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'merchandise.parquet' p ON oi.product_id = p.product_id
WHERE {where_clause}
GROUP BY p.product_id, p.product_name, p.class, p.base_price, p.value
)
SELECT
*,
ROUND(100.0 * revenue / total_revenue, 2) as profit_margin_pct,
ROUND(avg_selling_price / base_price, 2) as price_realization
FROM product_metrics
"""
# Add income filter if specified
if min_revenue:
question += f" WHERE total_revenue >= {min_revenue}"
question += f"""
ORDER BY total_revenue DESC
LIMIT {top_n}
"""
return con.execute(question).fetchdf()
This perform does the next. First, it builds SQL dynamically based mostly on parameters, permitting versatile filtering with out writing separate queries for every case. Second, it makes use of a Widespread Desk Expression (CTE) to prepare advanced logic into readable steps. Third, it calculates derived metrics like revenue margin and worth realization that require a number of supply columns.
The revenue calculation subtracts prices from income utilizing information from each the order objects and merchandise tables. This sort of cross-table calculation is simple in SQL however can be cumbersome with a number of pandas operations. DuckDB handles it effectively in a single question.
Right here is an instance that makes use of the perform above:
# Instance 1: High electronics merchandise
electronics = analyze_product_performance(con, class='Electronics', top_n=10)
print("Top 10 Electronics Products:")
print(electronics[['product_name', 'units_sold', 'total_revenue', 'profit_margin_pct']].to_string(index=False))
Output:
High 10 Electronics Merchandise:
product_name units_sold total_revenue profit_margin_pct
Electronics Merchandise 113 262.0 510331.81 38.57
Electronics Merchandise 154 289.0 486307.74 38.28
Electronics Merchandise 122 229.0 448680.64 38.88
Electronics Merchandise 472 251.0 444680.20 38.51
Electronics Merchandise 368 222.0 424057.14 38.96
Electronics Merchandise 241 219.0 407648.10 38.75
Electronics Merchandise 410 243.0 400078.65 38.31
Electronics Merchandise 104 233.0 400036.84 38.73
Electronics Merchandise 2 213.0 382583.85 38.76
Electronics Merchandise 341 240.0 376722.94 38.94
And right here is one other instance:
# Instance 2: Excessive-revenue merchandise throughout all classes
print("nnHigh-Revenue Products (>$50k revenue):")
high_revenue = analyze_product_performance(con, min_revenue=50000, top_n=10)
print(high_revenue[['product_name', 'category', 'total_revenue', 'profit']].to_string(index=False))
Output:
Excessive-Income Merchandise (>$50k income):
product_name class total_revenue revenue
Electronics Merchandise 113 Electronics 510331.81 196846.19
Electronics Merchandise 154 Electronics 486307.74 186140.78
Electronics Merchandise 122 Electronics 448680.64 174439.40
Electronics Merchandise 472 Electronics 444680.20 171240.80
Electronics Merchandise 368 Electronics 424057.14 165194.04
Electronics Merchandise 241 Electronics 407648.10 157955.25
Electronics Merchandise 410 Electronics 400078.65 153270.84
Electronics Merchandise 104 Electronics 400036.84 154953.46
Electronics Merchandise 2 Electronics 382583.85 148305.15
Electronics Merchandise 341 Electronics 376722.94 146682.94
# Wrapping Up
On this article, we analyzed e-commerce information. We generated relational information, saved it as Parquet, and queried it with DuckDB. The efficiency comparisons confirmed substantial speedups in comparison with conventional pandas approaches.
Use this stack if you end up performing analytical workloads on structured information. In case you are aggregating, filtering, becoming a member of, and computing metrics, that is helpful. It really works properly for information that adjustments in batches fairly than continuously. In case you are analyzing yesterday’s gross sales, processing month-to-month studies, or exploring historic traits, Parquet information up to date periodically work nice. You don’t want a stay database continuously accepting writes.
Nevertheless, this stack is just not appropriate for every part:
- When you want real-time updates with many concurrent writers, you want a standard database with ACID transactions
- In case you are constructing an software with user-facing queries that want millisecond response occasions, an listed database is healthier
- If a number of customers want to question concurrently with completely different entry permissions, a database server supplies higher management
The candy spot is analytical work on giant datasets, the place information updates occur in batches and also you want quick, versatile querying and evaluation.
Blissful analyzing!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embody DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and low! At the moment, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates partaking useful resource overviews and coding tutorials.



