# Introduction
For the past ten years, Pandas has served as the go-to tool for Python-based data tasks. When working with in-memory datasets, its speed and familiarity are so well-established that most developers never consider exploring alternatives.
But as soon as you begin handling millions of rows, its limitations become obvious: sluggish groupby operations lasting multiple seconds, intermediate duplicates that eat up memory, and window functions implemented as Python loops instead of optimized C or Rust vectorized code.
Polars is a DataFrame library written in Rust and built on Apache Arrow. It treats parallel processing and lazy evaluation as core design principles. Unlike Pandas, which runs each operation immediately and sequentially, Polars can construct a query execution plan, apply optimizations before running it, and automatically execute most operations in parallel across all available CPU cores.
Throughout this guide, we tackle three practical data challenges drawn from real questions on the StrataScratch coding platform. For each challenge, we present both libraries’ approaches and highlight where the performance gap truly makes a difference.

# Using rank() vs. with_row_count(): Activity Rank
Here, we need to calculate the email activity ranking for each user based on how many emails they have sent. The top email sender receives rank 1. The output should be ordered by total emails in descending order, with alphabetical ordering used as a tiebreaker, and every rank must remain unique — even when two users share the exact same email count.
// Data Overview
The google_gmail_emails table contains one entry per sent email, recording the sender (from_user), recipient (to_user), and the date of sending. Below is a sample of the table contents:
| id | from_user | to_user | day |
|---|---|---|---|
| 0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
| 1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
| 2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
| 3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
| 4 | … | … | … |
| 314 | e6088004caf0c8cc51 | e6088004caf0c8cc51 | 5 |
Grain (what each output row represents): a single user paired with their total email count and a distinct activity rank.
// Frequent Pitfall
The problem requires assigning a unique rank even when two users have identical email counts. Many people mistakenly reach for rank(method='dense') in Pandas, which gives tied users the same rank. The right approach is 'first', which resolves ties based on row position within the sorted data. Because we pre-sort alphabetically by user_id before applying ranks, the outcome is both unique and fully deterministic.
The ideal Polars approach sidesteps the rank function entirely. Once the data is sorted by ["total_emails", "user_id"] — descending and ascending, respectively — the .with_row_count("activity_rank", offset=1) call simply generates sequential integers starting from 1. No extra tie-breaking logic is needed since the sorting step already accounts for it.
// Solutions
1. Pandas Approach
We rename from_user to user_id, aggregate by user to count emails, calculate the first rank, then sort by email count in descending order with alphabetical tie-breaking.
import pandas as pd
import numpy as np
google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
result = google_gmail_emails.groupby(
['user_id']).size().to_frame('total_emails').reset_index()
result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])2. Polars Approach
We build a lazy processing chain that renames columns, groups the data, sorts, and assigns row numbers in a single streamlined flow. Triggering .collect() at the end executes and returns the final result.
import polars as pl
google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
result = (
google_gmail_emails.lazy()
.group_by("user_id")
.agg(total_emails = pl.count())
.sort(
by=["total_emails", "user_id"],
descending=[True, False]
)
.with_row_count("activity_rank", offset=1)
.select([
pl.col("user_id"),
"total_emails",
"activity_rank"
])
.collect()
)// Performance Comparison

The Pandas approach scans the data twice after grouping: once to compute sizes and again to apply ranks. Internally, rank(method='first') allocates a rank array, resolves ties via argsort, and writes values back — a process that carries more overhead than one might expect for a single column. The Polars group_by distributes the computation across all CPU cores, leading to substantially faster aggregation on large datasets. And since
The .with_row_count() clause performs a single sequential O(n) pass after sorting, replacing the rank function with the most efficient operation possible. On a table with millions of email records, using parallel aggregation without a rank function can lead to a 5–10x reduction in wall-clock time compared to the Pandas approach.
Here is a preview of the code output:
| user_id | total_emails | activity_rank |
|---|---|---|
| 32ded68d89443e808 | 19 | 1 |
| ef5fe98c6b9f313075 | 19 | 2 |
| 5b8754928306a18b68 | 18 | 3 |
| 55e60cfcc9dc49c17e | 16 | 4 |
| 91f59516cb9dee1e88 | 16 | 5 |
| … | … | … |
| e6088004caf0c8cc51 | 6 | 25 |
# cumcount() + pivot() vs. over(): Identifying Repeat Buyers
This question asks us to find returning active users — those who made a second purchase between 1 and 7 days after their first. Purchases occurring on the same day should be excluded. The output is simply a list of qualifying user_id values.
// Data Overview
The amazon_transactions table contains one row per purchase, including user_id, item, created_at date, and revenue.
Here is a sample of the table:
| id | user_id | item | created_at | revenue |
|---|---|---|---|---|
| 1 | 109 | milk | 2020-03-03 | 123 |
| 2 | 139 | biscuit | 2020-03-18 | 421 |
| 3 | 120 | milk | 2020-03-18 | 176 |
| … | … | … | … | … |
| 100 | 117 | bread | 2020-03-10 | 209 |
Grain (what each output row represents): a single user ID that made a qualifying return purchase within 7 days of their first.
// Edge Case
Same-day purchases must be excluded, so the interval between the first and second purchase must be greater than 0 days and no more than 7 days. A customer who buys twice on the same day does not qualify.
// Solutions
Both approaches determine each user’s earliest purchase date and then filter for subsequent purchases within the 1- to 7-day window. One important note: if created_at includes timestamps rather than plain dates, you must truncate to the date before comparing. Otherwise, two purchases made at different times on the same day would incorrectly satisfy the strict inequality.
1. Pandas Solution
In Pandas, the approach involves extracting unique purchase dates per user, ranking them with cumcount(), pivoting to align first and second dates side by side, and calculating the day difference.
import pandas as pd
amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
ranked = daily.sort_values(["user_id", "purchase_date"])
ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
first_two = (ranked[ranked["rn"] <= 2]
.pivot(index="user_id", columns="rn", values="purchase_date")
.reset_index()
.rename(columns={1: "first_date", 2: "second_date"}))
first_two = first_two.dropna(subset=["second_date"])
first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]
2. Polars Solution
The Polars approach computes the first purchase date per user as a window expression using .over("user_id"), filters for purchases within the time window, and returns a deduplicated user_id list.
import polars as pl
# returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
returning_users = (
amazon_transactions
.lazy()
# first purchase date per user (window to avoid .groupby on LazyFrame)
.with_columns(
pl.col("created_at").min().over("user_id").alias("first_purchase_date")
// Code Explanation
This code filters transactions occurring within 1 to 7 days after each user's first purchase. It then identifies unique users and sorts them in ascending order.
// Performance Comparison

Notice the number of distinct DataFrame allocations in the Pandas approach: the deduplicated daily table, the sorted ranked table, the pivoted frame, the dropna result, and the filtered output. These consist of five separate objects, each copying data into a new memory block. On a large transactions table, the pivot step alone can significantly boost memory usage, since it reshapes the entire dataset into a wide format.
The Polars lazy chain avoids allocating memory until .collect(). The .over("user_id") window expression computes each user's earliest purchase date in a single pass, the .filter() applies right within the same step, and .unique() runs concurrently across CPU cores. There is no pivot, no intermediate sorted copy, and no separate date-casting step — Polars handles date arithmetic natively inside its expression engine. This approach uses less memory and executes faster, even on moderately sized datasets.
Here is a preview of the code output:
user_id
100
103
105
...
143
# Expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average
The task here is to calculate a cumulative average for monthly book sales throughout 2022. The average grows each month by including all previous months: February averages January and February, March averages all three months, and so forth. The output should contain each month, that month's total sales, and the cumulative average rounded to the nearest whole number.
// Data Overview
The amazon_books table holds one row per book with its unit price. The book_orders table contains one row per order, connecting a book ID to a quantity and an order date. Below is a preview of the table:
book_id
book_title
unit_price
B001
The Hunger Games
25
B002
The Outsiders
50
B003
To Kill a Mockingbird
100
...
...
...
B020
The Pillars of the Earth
60
The book_orders table has one row per book order, mapping each order ID to an order date, book ID, and the quantity ordered:
order_id
order_date
book_id
quantity
1001
2022-01-10
B001
1
1002
2022-01-10
B009
1
1003
2022-01-15
B012
2
...
...
...
...
1084
2023-02-01
B009
1
Grain (what each output row represents): one month in 2022, showing total sales for that month along with a cumulative average of all monthly sales up to and including that month.
// Trade-Offs
With Pandas, .expanding().mean() is simple to use, but it runs a Python-level loop over expanding window slices internally. For a 12-row monthly summary, this overhead is minor. However, with daily or hourly data at scale (e.g., three years of hourly transactions), each expanding window slice adds cumulative cost that grows row by row.
Polars' cum_mean() completes a single pass in Rust and is inherently faster at scale. One caveat: the question requires rounding to the closest whole number, and Pandas applies banker's rounding (round half to even) by default. The Polars approach uses NumPy's cumsum with an explicit floor(x + 0.5) formula to enforce round-half-up behavior. If you need an exact match to the expected output, the NumPy strategy is more reliable than the built-in rounding in either library.
// Solutions
1. Pandas Approach
We combine books with orders, keep only 2022 records, aggregate monthly sales, then use .expanding().mean() to derive the cumulative average.
import pandas as pd
import numpy as np
import datetime as dt
merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["order_month"] = merged["order_date"].dt.month
merged["year"] = merged["order_date"].dt.year
merged["sales"] = merged["unit_price"] * merged["quantity"]
merged = merged.loc[(merged["year"] == 2022), :]
result = (
merged.groupby("order_month")["sales"]
.sum()
.to_frame("monthly_sales")
.sort_values(by="order_month")
.reset_index()
)
result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
result
2. Polars: Assembling the Lazy Pipeline and Materializing
We merge the two tables within a lazy pipeline, calculate sales as unit_price * quantity, restrict results to 2022, summarize by month, and invoke .collect() to transition back to eager mode prior to the NumPy rolling computation.
import polars as pl
import numpy as np
# Step 1: Build monthly sales aggregation (LazyFrame)
monthly_sales_lazy = (
book_orders.lazy()
.join(amazon_books.lazy(), on="book_id", how="inner")
.with_columns([
(pl.col("unit_price") * pl.col("quantity")).alias("sales"),
pl.col("order_date").cast(pl.Datetime),
pl.col("order_date").dt.year().alias("year"),
pl.col("order_date").dt.month().alias("order_month")
])
.filter(pl.col("year") == 2022)
.group_by("order_month")
.agg(pl.col("sales").sum().alias("monthly_sales"))
.sort("order_month")
)
# Step 2: Materialize into eager mode for rolling calculations
monthly_sales = monthly_sales_lazy.collect()
3. Calculating the Rolling Average and Wrapping Up
Treating the monthly sales figures as a NumPy array, we round using the half-up method, append the outcome into the Polars DataFrame, and extract the desired columns.
# Step 3: Rolling average using round-half-up logic
sales_np = monthly_sales["monthly_sales"].to_numpy()
cumsum = np.cumsum(sales_np)
rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
# Step 4: Inject back into Polars DataFrame
monthly_sales = monthly_sales.with_columns([
pl.Series("rolling_average", rolling_avg)
])
# Step 5: Produce the final output with the right column names
result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])
// Performance Comparison

This question features two operations that carry the biggest performance impact: the join and the cumulative window. In Pandas, pd.merge combines every row across both tables before narrowing down to 2022. That means records spanning every year get processed well before any irrelevant rows are excluded. Polars, on the other hand, generates a lazy query plan and applies the filter(year == 2022) condition ahead of the join, so it works with a reduced dataset right from the start. The predicate pushdown takes place transparently, with zero additional effort on your part.
The most striking speed difference lies in the rolling average computation. Pandas' .expanding().mean() enlarges its window incrementally, handing off to C code on each step while still being governed by a Python-level loop. Polars' cum_mean() processes the entire column through a single Rust-based pass, free from any Python overhead. The gap may be trivial with monthly-level granularity, but should you execute the same logic against daily records over three years (around 1,000 rows), the Polars variant finishes in microseconds whereas Pandas presents noticeable delay caused by the expanding window mechanics.
Below is a preview of the code output:
order_month sales rolling_average 1 145 145 2 250 198 3 315 237 ... ... ... 12 710 402
# Wrap-Up
Across all three problems, the Polars approaches share a common blueprint: construct a lazy query plan, funnel as much of the work as possible into the optimizer, and trigger .collect() only at the point where a materialized result is needed.
The syntax demands some recalibration if you, like the majority of analysts, have spent years building muscle memory with Pandas, but the core operations line up naturally. .groupby() maps to .group_by(), .rename() accepts a simple dictionary rather than a columns= keyword argument, and ranking boils down to a sort followed by .with_row_count().
The real distinction becomes clear at scale. With modest-sized datasets, both libraries deliver results swiftly enough that the gap is virtually invisible. Once row counts climb into the millions, Polars' parallelized Rust engine and single-pass algorithms pull decisively ahead. If Pandas performance bottlenecks are slowing you down, tackling these three challenges head-on offers a natural migration path.
Nate Rosidi is a data scientist specializing in product strategy. He also serves as an adjunct professor of analytics and founded StrataScratch, a platform that helps data professionals sharpen their skills with authentic interview questions sourced from leading companies. Nate covers emerging trends across the career landscape, offers interview preparation tips, showcases data science projects, and writes extensively about SQL.





