I Still Reach For Pandas: Why It Remains My Go-To For Data Wrangling

When I was learning data science back in 2020, Pandas stood out as one of the go-to tools. Even though newer libraries have emerged to address Pandas’ limitations with massive datasets, I still rely on it heavily for data cleaning, processing, and analysis. Sure, it struggles a bit when dealing with billions of rows, but for anything smaller, it handles the job perfectly.

I’ve noticed Pandas being used not just for exploratory data analysis or in Jupyter notebooks, but also in live production environments.

In this post, I’ll walk through some common data cleaning and processing tasks to show just how powerful Pandas really is.

First, let’s look at the dataset. It contains stock keeping units (SKUs) along with search API responses for each one.

import pandas as pd

search_results = pd.read_csv("search_results.csv")

search_results.head()

The search result is stored as a list of dictionaries and looks something like this:

search_results.loc[0, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}}, 
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}}, 
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}}, 
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}}, 
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}}, 
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}}, 
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}] 
... and 5 entities remaining"

As you can see, it’s not in a clean list-of-dictionaries format because of that trailing part (“… and 5 entities remaining”). Plus, it’s stored as a plain string.

To make it usable, we need to turn it into a proper list of dictionaries. The following line trims off the unwanted ending by splitting the string at “…” and keeping only the first part.

search_results.loc[0, "search_result"].split("...")[0].strip()

But the result is still just a string. To convert it into an actual list, we can use Python’s built-in `ast` module:

import ast

res = ast.literal_eval(search_results.loc[0, "search_result"].split("...")[0].strip())

res

[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]

Now we have a clean list of dictionaries—but only for one row. We need to apply this same transformation to every SKU in the dataset.

One approach would be to loop through each row with a `for` loop, but that’s inefficient. Instead, we should aim for vectorized operations, which process all rows at once.

The string-splitting method I used earlier doesn’t work well in a vectorized context. A more reliable solution is to use a regular expression (regex).

search_results.loc[:, 'search_result'] = search_results['search_result'].str.replace(r"....*", "", regex=True).str.strip()

This line finds the “…” and everything after it, then replaces them with nothing—effectively removing the “… and 5 entities remaining” portion.

Now, every row in the `search_result` column contains a properly formatted list of dictionaries.

search_results.loc[10, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]"

They’re still stored as strings, but I can easily convert them into actual lists using the `ast` module—which I’ll do next.

What I really care about are the SKUs returned in the search results. I’ll create a new column by pulling out the SKUs from each dictionary using the `”my_id”` key.

This task involves three steps:

Use `literal_eval` to convert the search result string into a list
Grab the SKU from the `”my_id”` key in each dictionary
Use a list comprehension to collect SKUs from all dictionaries in the list

We can accomplish all of this by applying a lambda function across every row like so:

search_results.loc[:, "result_skus"] = 
search_results["search_result"].apply(lambda x: [item['my_id'] for item in ast.literal_eval(x)])

search_results.head()

Each entry in the `result_skus` column now holds a list of 10 SKUs. Suppose I want each of these 10 SKUs to appear in its own separate row. For every original SKU row, I’ll generate 10 new rows—one for each result SKU. Pandas makes this incredibly easy with the `explode` function.

data = search_results[["sku", "result_skus"]].explode("result_skus", ignore_index=True)

data.head()

Top Posts

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

I Still Reach for Pandas: Why It Remains My Go-To for Data Wrangling

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Beyond the Hype: Architecting Your AI-Native Data Fortress

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Champions of the Diplomatic Corps: Democrats Rally Around Fallen Foreign Service Officers

The Ultimate Blood Pressure Showdown: My Month-Long Wearable Battle Royale

Trending

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

I Still Reach for Pandas: Why It Remains My Go-To for Data Wrangling

Related Posts