for a while now. Nothing too loopy although. Simply fundamental information cleansing, exploratory information evaluation, and a few important capabilities. I’ve additionally explored issues like technique chaining for cleaner, extra organized code, and operations that silently break your Pandas workflow, each of which I’ve written about earlier than.
I by no means actually considered runtime. Truthfully, if my code ran with out errors and gave me the output I wanted, I used to be blissful. Even when it took a couple of minutes for all my pocket book cells to complete, I didn’t care. No errors meant no issues, proper?
Then I got here throughout the idea of vectorization. And one thing clicked.
I went down the rabbit gap, as I normally do. The extra I learn, the extra I noticed that “no errors” and “efficient code” are two very various things. Your Pandas code might be utterly appropriate and nonetheless be quietly horrible at scale.
So this text is me documenting what I discovered. The errors that sluggish Pandas code down, why they occur, how one can repair them, and when Pandas itself is likely to be the bottleneck. Should you’ve ever run a pocket book and simply assumed the wait time was regular, this one’s for you.
Why “Working Code” Isn’t Good Sufficient
There’s a purpose this took me some time to consider. Pandas is designed to be forgiving. You possibly can write code in a dozen other ways and most of them will work. You get your output, your dataframe appears proper, and you progress on.
However that flexibility comes with a hidden price.
In contrast to SQL or production-grade information programs, Pandas doesn’t drive you to consider effectivity. It doesn’t warn you once you’re doing one thing costly. It simply… does it. Slowly, typically. Nevertheless it does it.
Give it some thought this manner. SQL has a question optimizer. It appears at what you’re asking for and figures out probably the most environment friendly strategy to get it. Pandas doesn’t have that. It trusts you to put in writing environment friendly code. And in case you don’t know what environment friendly appears like, you’ll by no means know you’re lacking it.
The result’s that a variety of Pandas code within the wild is what I’d name politely inefficient. It really works on small datasets. It really works on medium datasets with a little bit endurance. However the second you throw real-world information at it, one thing that’s a couple of hundred thousand rows or extra, the cracks begin to present. What used to take seconds now takes minutes. What took minutes turns into unusable.
And the irritating half is nothing appears improper. No errors. No warnings. Only a sluggish pocket book and a spinning cursor.
That’s the entice. Pandas optimizes for comfort, not velocity. And comfort is nice, till it isn’t.
So the primary shift is a mindset one: working code and environment friendly code will not be the identical factor. As soon as that clicks, all the pieces else follows.
Profiling: Cease Guessing, Begin Measuring
Right here’s one thing I observed whereas happening this rabbit gap. Most individuals, once they really feel like their code is sluggish, do considered one of two issues. They both rewrite the entire thing from scratch hoping one thing improves, or they only settle for it and wait.
Neither of these is the correct transfer.
The proper transfer is to measure first. You possibly can’t optimize what you haven’t recognized. And most of the time, the a part of your code you assume is sluggish isn’t truly the issue.
Pandas offers you a couple of easy instruments to begin with.
%timeit — Know How Lengthy Issues Truly Take
%timeit is a Jupyter magic command that runs a line of code a number of instances and provides you the typical execution time. It’s the only strategy to examine two approaches and know, concretely, which one is quicker.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'gross sales': np.random.randint(100, 10000, dimension=100_000),
'low cost': np.random.uniform(0.0, 0.5, dimension=100_000)
})
# Method A
%timeit df.apply(lambda row: row['sales'] * row['discount'], axis=1)
# Method B
%timeit df['sales'] * df['discount']On a dataset of 100,000 rows, the distinction just isn’t refined:
1.91 s ± 228 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
316 μs ± 14 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)Identical output. Fully totally different price. That’s the type of factor you’d by no means discover by simply operating the cell as soon as and shifting on.
df.information() and df.memory_usage() — Know What You’re Carrying
Velocity isn’t nearly computation. Reminiscence performs an enormous position too. A dataframe that’s bloated with the improper information varieties will sluggish all the pieces down earlier than you’ve even written a single transformation.
df.information()Output:
RangeIndex: 100000 entries, 0 to 99999
Information columns (whole 2 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 gross sales 100000 non-null int64
1 low cost 100000 non-null float64
dtypes: float64(1), int64(1)
reminiscence utilization: 1.5 MB To verify the reminiscence utilization
df.memory_usage(deep=True)Output:
Index 132
gross sales 400000
low cost 800000
dtype: int64Right here, we are able to see that low cost is taking on twice the house. It’s because low cost is saved as a “heavier” quantity kind (float64) whereas gross sales is saved in a “lighter” kind (int32).
This turns into particularly essential once you’re working with string columns or object varieties which might be secretly consuming reminiscence. We’ll come again to this within the subsequent part.
The Profiling Mindset
The instruments themselves are easy. The shift is in the way you strategy your code. Earlier than you optimize something, ask: the place is the time truly going? Measure the sluggish elements. Evaluate alternate options. Let the numbers inform you what to repair.
As a result of what feels sluggish and what’s sluggish are sometimes two various things fully.
Mistake #1: Row-wise Operations (The Silent Killer)
If there’s one factor I saved seeing come up time and again whereas researching this matter, it was this: individuals looping via Pandas dataframes row by row. And I get it. It feels pure. You concentrate on your information one row at a time, so that you write code that processes it one row at a time.
The issue is, that’s not how Pandas thinks.
How Pandas Truly Works
Pandas is constructed on prime of NumPy, which shops information in contiguous blocks of reminiscence, column by column. This implies Pandas is closely optimized to function on complete columns directly. Whenever you do this, it runs quick, low-level, vectorized operations below the hood.
Whenever you loop via rows as a substitute, you’re basically bypassing all of that. You’re dropping down into pure Python, one row at a time, with all of the overhead that comes with it. On a small dataset you’ll by no means discover. On a big one, you’ll be ready a very long time.
There are two patterns that present up always.
.iterrows()
# Calculating a reduced value row by row
discounted_prices = []
for index, row in df.iterrows():
discounted_prices.append(row['sales'] * (1 - row['discount']))
df['discounted_price'] = discounted_pricesThis works. It provides you with the correct reply. However on a dataframe with 100,000 rows, it’s painfully sluggish.
%timeit [row['sales'] * (1 - row['discount']) for index, row in df.iterrows()]Output:
10.2 s ± 1.73 s per loop (imply ± std. dev. of seven runs, 1 loop every).apply(axis=1)
This one is sneakier as a result of it appears extra “Pandas-like.” However making use of a perform throughout axis=1 means making use of it row by row, which is actually the identical drawback.
%timeit df.apply(lambda row: row['sales'] * (1 - row['discount']), axis=1)Output:
1.5 s ± 88.1 ms per loop (imply ± std. dev. of seven runs, 1 loop every)Quicker than .iterrows(), however nonetheless working row by row. Nonetheless sluggish.
The Repair: Vectorized Operations
Right here’s the identical calculation, finished the best way Pandas truly needs you to do it:
df['discounted_price'] = df['sales'] * (1 - df['discount'])Let’s time it
%timeit df['sales'] * (1 - df['discount'])Output:
688 μs ± 236 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)That’s it. One line. No loop. No lambda. And it’s roughly 14,800x sooner than .iterrows() and 2,180x sooner than .apply(axis=1).
What’s occurring right here is that Pandas passes the whole column to NumPy, which executes the operation on the C degree throughout the entire array directly. No Python overhead. No row-by-row iteration. Simply quick, low-level computation.
When .apply() Is Truly High quality
To be honest, .apply() isn’t all the time the villain. Whenever you’re making use of a perform column-wise (axis=0, which is the default), it’s usually completely cheap. The problem is particularly axis=1, which forces row-by-row execution.
And typically your logic is genuinely advanced sufficient {that a} clear vectorized expression isn’t apparent. In these circumstances, np.vectorize() or np.the place() can provide you one thing nearer to vectorized efficiency whereas nonetheless letting you specific conditional logic clearly.
# As a substitute of this
df['category'] = df.apply(
lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1
)
# Do that
df['category'] = np.the place(df['sales'] > 5000, 'excessive', 'low')%timeit df.apply(lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1)
%timeit np.the place(df['sales'] > 5000, 'excessive', 'low')Output:
1.31 s ± 189 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
1.3 ms ± 180 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)Identical end result. About 1,000x sooner.
The Rule of Thumb
Should you’re writing a loop over rows in Pandas, cease and ask your self: can this be expressed as a column operation? 9 instances out of ten, the reply is sure. And when it’s, the efficiency distinction is transformative.
Should you’re looping via rows, you’re not utilizing Pandas. You’re utilizing Python with further steps.
Mistake #2: Pointless Copies and Reminiscence Bloat
Row-wise operations get a variety of consideration when individuals discuss Pandas efficiency. Reminiscence will get rather a lot much less. Which is a disgrace, as a result of in my expertise studying about this, bloated reminiscence is simply as answerable for sluggish notebooks as dangerous computation.
Right here’s the factor. Pandas operations don’t all the time modify your dataframe in place. A number of them quietly create a model new copy of your information behind the scenes. Do this sufficient instances, and also you’re not simply holding one dataframe in reminiscence. You’re holding a number of, suddenly, with out realizing it.
The Hidden Value of Chained Operations
Chained operations are a typical wrongdoer. They give the impression of being clear and readable, however every step can generate an intermediate copy that sits in reminiscence till rubbish assortment cleans it up.
# Every step right here doubtlessly creates a brand new copy
df2 = df[df['sales'] > 1000]
df3 = df2.dropna()
df4 = df3.reset_index(drop=True)
df5 = df4[['sales', 'discount']]By the point you get to df5, you doubtlessly have 5 variations of your information floating round in reminiscence concurrently. On a small dataset that is invisible. On a big one, that is the way you run out of RAM.
Momentary Columns That Stick Round
One other sample that quietly eats reminiscence is creating columns you solely wanted briefly.
df['gross_revenue'] = df['sales'] * df['quantity']
df['tax'] = df['gross_revenue'] * 0.075
df['net_revenue'] = df['gross_revenue'] - df['tax']
# However you solely truly wanted net_revenuegross_revenue and tax at the moment are everlasting columns in your dataframe, taking on reminiscence for the remainder of your pocket book although they had been simply stepping stones.
The repair is straightforward. Both compute immediately:
df['net_revenue'] = (df['sales'] * df['quantity']) * (1 - 0.075)Or drop them as quickly as you’re finished:
df.drop(columns=['gross_revenue', 'tax'], inplace=True)Improper Information Sorts Are Quietly Costly
This one shocked me once I got here throughout it. By default, Pandas is kind of beneficiant with how a lot reminiscence it assigns to every column. Integer columns get int64. Float columns get float64. String columns develop into object kind, which is among the most memory-hungry varieties in Pandas.
Let’s see what that really appears like:
df = pd.DataFrame({
'order_id': np.random.randint(1000, 9999, dimension=100_000),
'gross sales': np.random.randint(100, 10000, dimension=100_000),
'low cost': np.random.uniform(0.0, 0.5, dimension=100_000),
'area': np.random.alternative(['north', 'south', 'east', 'west'], dimension=100_000)
})
df.memory_usage(deep=True)Output
Index 132
order_id 400000
gross sales 400000
low cost 800000
area 5350066
dtype: int64That area column, which solely has 4 attainable values, is consuming 5.3MB as an object kind. Convert it to a categorical and watch what occurs:
df['region'] = df['region'].astype('class')
df.memory_usage(deep=True)Output:
Index 132
order_id 400000
gross sales 400000
low cost 800000
area 100386
dtype: int64From 5.3MB all the way down to about 100KB. For one column. The identical logic applies to integer columns the place you don’t want the complete int64 vary. In case your values match comfortably in int32 and even int16, downcasting saves actual reminiscence.
df['sales'] = df['sales'].astype('int32')
df['order_id'] = df['order_id'].astype('int32')
df.memory_usage(deep=True)Output:
Index 128
order_id 400000
gross sales 400000
low cost 800000
area 100563
dtype: int64A number of small kind adjustments and your dataframe is already considerably lighter. And a lighter dataframe means sooner operations throughout the board, as a result of there’s merely much less information to maneuver round.
The Fast Reminiscence Examine Behavior
Earlier than you run any heavy transformation, it’s value figuring out what you’re working with:
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")It takes one second and it tells you precisely how a lot reminiscence your dataframe is consuming at that time. Make it a behavior earlier than and after main transformations and also you’ll rapidly develop an instinct for when one thing is heavier than it must be.
The Perception
Sluggish code isn’t all the time about computation. Typically your pocket book is sluggish as a result of it’s carrying way more information than it must, in codecs which might be far dearer than needed. Trimming reminiscence isn’t glamorous work, nevertheless it compounds. A dataframe that’s lighter to retailer is quicker to filter, sooner to merge, sooner to rework.
Reminiscence and velocity will not be separate issues. They’re the identical drawback.
Mistake #3: Overusing Pandas for All the pieces
This one is a little bit totally different from the earlier two. It’s not a couple of particular perform or a foul behavior. It’s about figuring out the bounds of your instrument.
Pandas is genuinely nice. For many information duties, particularly on the scale most individuals are working at, it’s greater than sufficient. However there’s a model of Pandas utilization that I saved seeing described whereas researching this: individuals reaching for Pandas by default, for all the pieces, no matter whether or not it’s the correct match.
And at a sure scale, that turns into an issue.
The Dataset
To make this actual, I generated an artificial e-commerce dataset with 1 million rows. Nothing unique, simply the type of information you’d realistically encounter: orders, dates, areas, classes, gross sales figures, reductions, portions and statuses.
import pandas as pd
import numpy as np
np.random.seed(42)
n = 1_000_000
areas = ['north', 'south', 'east', 'west']
classes = ['electronics', 'clothing', 'furniture', 'food', 'sports']
statuses = ['completed', 'returned', 'pending', 'cancelled']
df = pd.DataFrame({
'order_id': np.arange(1000, 1000 + n),
'order_date': pd.date_range(begin='2022-01-01', durations=n, freq='1min'),
'area': np.random.alternative(areas, dimension=n),
'class': np.random.alternative(classes, dimension=n),
'gross sales': np.random.randint(100, 10000, dimension=n),
'amount': np.random.randint(1, 20, dimension=n),
'low cost': np.spherical(np.random.uniform(0.0, 0.5, dimension=n), 2),
'standing': np.random.alternative(statuses, dimension=n),
})
df.to_csv('large_sales_data.csv', index=False)A million rows. Saved to a CSV. That is the dataset we’ll be working with for the remainder of the article.
The place Pandas Begins to Battle
Pandas hundreds your complete dataset into reminiscence. That’s positive when your information is a couple of hundred thousand rows. It begins to get uncomfortable at a couple of million. And past that, you’re preventing the instrument.
The opposite state of affairs is advanced, nested transformations the place you’re stacking a number of operations, creating intermediate outcomes, and usually asking Pandas to do a variety of heavy lifting in sequence. Every step provides overhead. The prices stack up.
Right here’s a sensible instance utilizing our dataset. Say that you must calculate a rolling common of gross sales per area, flag orders above a threshold, then combination by month:
# Step 1: Type
df = df.sort_values(['region', 'order_date'])
# Step 2: Rolling common per area
df['rolling_avg'] = (
df.groupby('area')['sales']
.remodel(lambda x: x.rolling(window=7).imply())
)
# Step 3: Flag high-value orders
df['high_value'] = df['sales'] > df['rolling_avg'] * 1.5
# Step 4: Month-to-month aggregation
df['month'] = pd.to_datetime(df['order_date']).dt.to_period('M')
monthly_summary = df.groupby(['region', 'month'])['sales'].sum()This works. However discover that Step 2 makes use of .remodel(lambda x: ...), which carries the identical row-adjacent price we talked about earlier. On 1 million rows, this pipeline will drag. Go forward and time it in your machine and also you’ll see precisely what I imply.
What to Attain For As a substitute
The excellent news is you don’t need to abandon Pandas fully. There are a couple of choices relying on the scenario.
Chunking
In case your dataset is just too giant to load suddenly, Pandas permits you to course of it in chunks. As a substitute of loading all 1 million rows into reminiscence directly, you load and course of a portion at a time:
chunk_size = 100_000
outcomes = []
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
chunk['discounted_price'] = chunk['sales'] * (1 - chunk['discount'])
outcomes.append(chunk.groupby('area')['discounted_price'].sum())
final_result = pd.concat(outcomes).groupby(degree=0).sum()
print(final_result)As a substitute of asking Pandas to carry 1 million rows in reminiscence concurrently, you’re feeding it 100,000 rows at a time, processing every chunk, and assembling the outcomes on the finish. It’s not probably the most elegant sample, nevertheless it permits you to work with information that might in any other case crash your kernel.
When to Take into account Different Instruments
Typically the trustworthy reply is that Pandas isn’t the correct instrument for the job. This isn’t a criticism, it’s simply scope. A number of value figuring out about:
- Polars: A contemporary dataframe library in-built Rust, designed for velocity. It makes use of lazy analysis, which means it optimizes your complete question earlier than executing it. For giant datasets it may be dramatically sooner than Pandas.
- Dask: Extends Pandas to work in parallel throughout a number of cores and even a number of machines. Should you’re comfy with Pandas syntax, Dask feels acquainted.
- DuckDB: Allows you to run SQL queries immediately in your dataframes or CSV recordsdata with surprisingly quick efficiency. Nice for aggregations and analytical queries on giant information.
The purpose isn’t to desert Pandas. For many on a regular basis information work, it’s the correct alternative. The purpose is to acknowledge once you’ve hit its ceiling, and know that there are good choices on the opposite facet of it.
The Actual-World Refactor: From 61 Seconds to 0.33 Seconds
That is the place all the pieces we’ve coated stops being theoretical.
I took our 1 million row e-commerce dataset and wrote the type of Pandas code that feels utterly regular. The type of factor you’d write on a Tuesday afternoon with out considering twice.
Then I timed it.
The Sluggish Model
import time
df = pd.read_csv('large_sales_data.csv')
begin = time.time()
# Row-wise income calculation
df['gross_revenue'] = df.apply(
lambda row: row['sales'] * row['quantity'], axis=1
)
df['tax'] = df.apply(
lambda row: row['gross_revenue'] * 0.075, axis=1
)
df['net_revenue'] = df.apply(
lambda row: row['gross_revenue'] - row['tax'], axis=1
)
# Row-wise flagging
df['order_flag'] = df.apply(
lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
)
# Ultimate aggregation
end result = df.groupby('area')['net_revenue'].sum()
finish = time.time()
print(f"Total runtime: {end - start:.2f} seconds")Output:
Complete runtime: 61.78 secondsOver a minute. For a four-step pipeline. And nothing appears improper. Let’s break down precisely what’s making it sluggish.
Three errors, multi functional pipeline:
- First, the info varieties are by no means addressed. The
area,classandstandingcolumns load as genericobjectvarieties, that are memory-hungry and sluggish to work with. We’re carrying that lifeless weight via each single operation. - Second, there are three separate
.apply(axis=1)calls simply to calculate income. Every one loops via all 1 million rows in Python, separately. We already noticed in Part 4 how costly that’s. Right here we’re doing it 3 times in a row. - Third,
gross_revenueandtaxare created as everlasting columns although they’re simply intermediate steps. They serve no function past being stepping stones tonet_revenue, however they sit in reminiscence for the remainder of the pipeline anyway.
Right here’s how I’d repair this step-by-step
Step 1: Repair information varieties upfront
Earlier than the rest, convert the plain categorical columns:
df['region'] = df['region'].astype('class')
df['category'] = df['category'].astype('class')
df['status'] = df['status'].astype('class')This alone reduces reminiscence utilization considerably and makes subsequent operations cheaper throughout the board.
Step 2: Substitute .apply() with vectorized operations
As a substitute of three separate row-wise calls, one vectorized expression does the identical work:
# Earlier than: three .apply() calls, three passes via 1 million rows
df['gross_revenue'] = df.apply(lambda row: row['sales'] * row['quantity'], axis=1)
df['tax'] = df.apply(lambda row: row['gross_revenue'] * 0.075, axis=1)
df['net_revenue'] = df.apply(lambda row: row['gross_revenue'] - row['tax'], axis=1)
# After: one vectorized expression, no non permanent columns
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)Step 3: Substitute row-wise flagging with np.the place()
# Earlier than
df['order_flag'] = df.apply(
lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
)
# After
df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')Identical logic. Vectorized. Carried out.
The Quick Model
Put all of it collectively and the pipeline appears like this:
import time
df = pd.read_csv('large_sales_data.csv')
begin = time.time()
# Repair 1: Right information varieties upfront
df['region'] = df['region'].astype('class')
df['category'] = df['category'].astype('class')
df['status'] = df['status'].astype('class')
# Repair 2: Vectorized income calculation, no non permanent columns
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)
# Repair 3: Vectorized flagging with np.the place
df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')
# Ultimate aggregation
end result = df.groupby('area')['net_revenue'].sum()
finish = time.time()
print(f"Total runtime: {end - start:.2f} seconds")Output:
Complete runtime: 0.33 seconds61.78 seconds all the way down to 0.33 seconds. A 99.5% discount in runtime. That’s like 187x sooner.
It’s not a trick. That’s simply how Pandas is meant for use.
Earlier than You Run Your Subsequent Pocket book
All the pieces we coated comes down to some core habits. Not guidelines. Not tips. Only a totally different mind-set about your code earlier than you write it.
- Assume in columns, not rows. Should you’re looping via a dataframe row by row, cease and ask whether or not the identical factor might be expressed as a column operation. 9 instances out of ten, it could.
- Measure earlier than you optimize. Don’t guess the place the slowness is coming from. Use
%timeitanddf.memory_usage()to let the numbers inform you what to repair. - Watch your reminiscence, not simply your velocity. Improper information varieties, pointless copies and non permanent columns all add up. A lighter dataframe is a sooner dataframe.
- Know when to modify instruments. Pandas is the correct alternative more often than not. However at a sure scale, the correct optimization is recognizing that you simply’ve outgrown it.
I began this rabbit gap as a result of I saved seeing the identical dialog come up in information communities. Individuals annoyed with sluggish notebooks, code that labored positive on small information and fell aside on actual information. I needed to know why.
What I discovered was that the code wasn’t damaged. It simply wasn’t constructed to scale. And the hole between code that works and code that works effectively isn’t about being a complicated Pandas person. It’s a couple of handful of habits utilized persistently.
Should you’ve ever waited too lengthy for a pocket book to complete and simply assumed that was regular, now you understand it doesn’t need to be.
If this modified how you concentrate on your Pandas code, I’d love to listen to what bottlenecks you’ve been coping with. Be at liberty to say hello on any of those platforms
Medium
YouTube



