When I was learning data science back in 2020, Pandas stood out as one of the go-to tools. Even though newer libraries have emerged to address Pandas’ limitations with massive datasets, I still rely on it heavily for data cleaning, processing, and analysis. Sure, it struggles a bit when dealing with billions of rows, but for anything smaller, it handles the job perfectly.
I’ve noticed Pandas being used not just for exploratory data analysis or in Jupyter notebooks, but also in live production environments.
In this post, I’ll walk through some common data cleaning and processing tasks to show just how powerful Pandas really is.
First, let’s look at the dataset. It contains stock keeping units (SKUs) along with search API responses for each one.
import pandas as pd
search_results = pd.read_csv("search_results.csv")
search_results.head()The search result is stored as a list of dictionaries and looks something like this:
search_results.loc[0, "search_result"]
"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]
... and 5 entities remaining"As you can see, it’s not in a clean list-of-dictionaries format because of that trailing part (“… and 5 entities remaining”). Plus, it’s stored as a plain string.
To make it usable, we need to turn it into a proper list of dictionaries. The following line trims off the unwanted ending by splitting the string at “…” and keeping only the first part.
search_results.loc[0, "search_result"].split("...")[0].strip()But the result is still just a string. To convert it into an actual list, we can use Python’s built-in `ast` module:
import ast
res = ast.literal_eval(search_results.loc[0, "search_result"].split("...")[0].strip())
res
[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]Now we have a clean list of dictionaries—but only for one row. We need to apply this same transformation to every SKU in the dataset.
One approach would be to loop through each row with a `for` loop, but that’s inefficient. Instead, we should aim for vectorized operations, which process all rows at once.
The string-splitting method I used earlier doesn’t work well in a vectorized context. A more reliable solution is to use a regular expression (regex).
search_results.loc[:, 'search_result'] = search_results['search_result'].str.replace(r"....*", "", regex=True).str.strip()This line finds the “…” and everything after it, then replaces them with nothing—effectively removing the “… and 5 entities remaining” portion.
Now, every row in the `search_result` column contains a properly formatted list of dictionaries.
search_results.loc[10, "search_result"]
"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]"They’re still stored as strings, but I can easily convert them into actual lists using the `ast` module—which I’ll do next.
What I really care about are the SKUs returned in the search results. I’ll create a new column by pulling out the SKUs from each dictionary using the `”my_id”` key.
This task involves three steps:
- Use `literal_eval` to convert the search result string into a list
- Grab the SKU from the `”my_id”` key in each dictionary
- Use a list comprehension to collect SKUs from all dictionaries in the list
We can accomplish all of this by applying a lambda function across every row like so:
search_results.loc[:, "result_skus"] =
search_results["search_result"].apply(lambda x: [item['my_id'] for item in ast.literal_eval(x)])
search_results.head()
Each entry in the `result_skus` column now holds a list of 10 SKUs. Suppose I want each of these 10 SKUs to appear in its own separate row. For every original SKU row, I’ll generate 10 new rows—one for each result SKU. Pandas makes this incredibly easy with the `explode` function.
data = search_results[["sku", "result_skus"]].explode("result_skus", ignore_index=True)
data.head()
We built a new dataframe containing the sku and result_skus columns. The illustration below shows how the explode function works:

Now let’s look at the reverse scenario. Suppose we have a dataframe like the one above, and we want all the results for each sku consolidated into a single row.
To achieve this, we can use the groupby function to group the rows by sku, then apply the list function on the result_skus column:
new_data = data.groupby("sku", as_index=False)["result_skus"].apply(list)
new_data.head()This brings us back to the earlier step:

With the explode function, we generated a dataframe where each sku in the result_skus column gets its own row. But what if we want those values spread across separate columns instead of separate rows?
One approach is to use the pd.Series function on the result_skus column and then merge the resulting columns with the original dataframe.
new_cols = new_data["result_skus"].apply(pd.Series)
new_data = pd.concat([new_data, new_cols], axis=1)
new_data.head()
Columns 0 through 9 hold the 10 SKUs from the result_skus column. Keep in mind that this code using the apply function is not a vectorized operation.
There’s an alternative approach that is vectorized and significantly faster.
new_cols = pd.DataFrame(new_data["result_skus"].tolist())
new_data = pd.concat([new_data, new_cols], axis=1)This code produces the same dataframe as the previous method but runs much faster.
I’ve walked through a common data cleaning and processing task that data scientists and analysts regularly face in their work. With over 5 years of experience in the field, Pandas has consistently met my needs, except when handling extremely large datasets (for example, billions of rows).
Tools better suited for such large-scale datasets share syntax similar to Pandas. For instance, PySpark blends elements of Pandas and SQL. Polars closely mirrors Pandas in its syntax. Therefore, learning and practicing Pandas remains an extremely valuable skill for anyone in the data science and AI field.
Thanks for reading.



