In this installment of my data engineering series, I laid out a 12-month plan for moving from data analyst to data engineer. Now, it’s time to start creating.
When I put out my initial piece about learning data engineering, the reaction surprised me. It struck a chord with readers. I received messages from people I didn’t know, eager to join the ride. It was encouraging.
It also brought a sense of obligation.
This was no longer a private ambition I could quietly drop if the going got tough. Others were paying attention. Others were facing similar challenges. That sense of shared commitment, frankly, is a big reason you’re seeing this article today.
So I got to work. And like anyone picking up a new discipline, my first move was to hunt for learning material. The internet is overflowing with data engineering content. Videos, online courses, blog tutorials. More than anyone could ever get through.
But I just couldn’t sit back and absorb theory alone. I needed hands-on work. A tangible project, using live data, that produced a real result when it was done.
So I shut the tutorials and launched a Google Colab notebook. I pulled up the GitHub API docs and committed to building my first ETL pipeline from the ground up. No safety net. Just Python, me, and a clear objective.
This article captures that whole experience. The scripting, the stumbles, the little breakthroughs, and the genuine lessons I picked up along the way.
First, a quick ETL primer
Before I walk through what I put together, let me briefly cover what ETL stands for, since I had to learn this myself fairly recently.
ETL is short for Extract, Transform, Load. It’s one of the core ideas in data engineering.
- Extract is the act of fetching data from somewhere. An API, a database, a website, a flat file. You’re pulling unprocessed information out of a source.
- Transform is the cleanup and restructuring phase. Filtering out bad entries, computing new fields, reshaping the data so it serves a purpose.
- Load is the step where you store the cleaned output somewhere. A database, a warehouse, or even a plain CSV.
That’s the whole picture. Those three operations, chained together, form what’s called a data pipeline. Everything beyond that in the data engineering world like Airflow, Spark, Databricks is just a more powerful way of doing the same three tasks at a larger scale.
I’m still early in my roadmap, so I kept things basic. Straightforward Python, no workflow orchestration yet. But the underlying problem is identical.
My project
I pulled data from the GitHub API, zeroing in on the most starred Python repositories launched in the previous 30 days. From there, I cleaned it up, tacked on an additional column, and exported everything as a CSV file.
Modest. Practical. Built entirely by me.
Here’s the full breakdown.
Step 1: Extract
My first challenge was understanding how to interact with the GitHub API. An API is essentially a gateway that a platform provides so developers can request data programmatically instead of manually copying and pasting.
GitHub offers a free, public API. No login or subscription required for basic queries.
This is the code I wrote to pull the data:
import requests
url = "https://api.github.com/search/repositories"
params = {
"q": "language:python created:>2025-04-22",
"sort": "stars",
"order": "desc",
"per_page": 30
}
response = requests.get(url, params=params)
data = response.json()
print(response.status_code)
print(data.keys())I’ll be upfront. This snippet threw me off initially. The requests library was unfamiliar. The params dictionary with that q query syntax looked foreign. I wasn’t immediately clear on what .json() was doing or why it was necessary.
Here’s a straightforward rundown.
requests.get()is how you reach out to GitHub and ask for something. Theurlis the endpoint you’re querying. Theparamsdictionary holds the specifics of your request. Here it means: “show me Python repos, ordered by star count, created after April 22, with 30 results.”.json()takes the raw response from GitHub and turns it into a Python dictionary you can manipulate.
When I executed it, this is what came back:
200
dict_keys(['total_count', 'incomplete_results', 'items'])The 200 signals a successful request. That is the web’s way of confirming “everything worked on our end.” A 403 or 404 would mean something went sideways.
The response dictionary contains three keys. total_count shows how many repositories matched the search criteria. incomplete_results flags whether GitHub truncated any results. And items holds the actual repository data.
I then ran another snippet to inspect what was inside:
print("Total matches on GitHub:", data['total_count'])
print("Repos returned:", len(data['items']))
first_repo = data['items'][0]
print("nFirst repo name:", first_repo['name'])
print("Stars:", first_repo['stargazers_count'])
print("Language:", first_repo['language'])
print("URL:", first_repo['html_url'])Output:
Total matches on GitHub: 9228201
Repos returned: 30
First repo name: skills
Stars: 139136
Language: Python
URL: https://github.com/anthropics/skillsThe top result was an Anthropic repository with 139,000 stars. Live data. Real data. Retrieved by code I had written myself.
That wraps up the Extract phase.
Step 2: Transform
At this point I had 30 repositories sitting in a Python list, each one a deeply nested dictionary packed with fields. The majority weren’t useful for my purposes. The Transform phase is where you take raw, unwieldy data and turn it into something neat and structured.
My first move was to isolate only the columns I needed and load everything into a Pandas dataframe:
import pandas as pd
repos = []
for repo in data['items']:
repos.append({
"name": repo['name'],
"owner": repo['owner']['login'],
"stars": repo['stargazers_count'],
"forks": repo['forks_count'],
"language": repo['language'],
"description": repo['description'],
"url": repo['html_url'],
"created_at": repo['created_at']
})
df = pd.DataFrame(repos)
df.head()Watching that dataframe render was a genuine “wow” moment. I went from a mess of raw JSON to a tidy, labeled table with just a few lines of code.
Then I carried out three transformations:
# Remove rows where description is missing
df_clean = df.dropna(subset=['description'])
# Add a viral flag for repos with over 50k stars
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')
# Sort by stars descending
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)
print("Before cleaning:", len(df))
print("After cleaning:", len(df_clean))Output:
Before cleaning: 30
After cleaning: 29One repository was missing a description, so it was removed. The new “viral” column appeared neatly. The dataset was now sorted and well-organized.
That wraps up the Transform step.
Step 3: Load
The final stage. Take the cleaned data and save it somewhere useful. I kept things simple by exporting it to a CSV file:
df_clean.to_csv('github_trending_repos.csv', index=False)
print("Pipeline complete. File saved.")
print(f"{len(df_clean)} repos loaded into github_trending_repos.csv")Output:
Pipeline complete. File saved.
29 repos loaded into github_trending_repos.csvI downloaded the file and opened it. A tidy spreadsheet containing 29 rows and 9 columns. Real GitHub data, organized and saved by a pipeline I built entirely on my own.
That wraps up the Load step.
What this actually felt like
Before this, whenever I needed data to work on, I’d go hunting for a public dataset someone else had already cleaned and uploaded. Kaggle, Google Dataset Search, or similar sources. I was always relying on data that someone else had prepared.
This shifted my perspective entirely.
The moment I realized I could simply point Python at an API I was curious about and pull live data myself, everything opened up. I’m no longer restricted to datasets that already exist. I can create my own pipeline to generate the dataset.
That’s a fundamentally different kind of capability. And it’s one of the reasons I was drawn toward data engineering in the first place.
What’s next
This pipeline is intentionally simple. I’m at the beginning of my roadmap, and I won’t pretend I’m using Airflow or Spark just yet. But the foundation is solid. Extract, Transform, Load. It works. I built it. I understand it.
The next step is to make it more reliable. Schedule it to run every day. Store the output in a SQLite database rather than a basic CSV file. Begin tracking how repositories trend over time.
Down the road, I’ll orchestrate the entire workflow with Airflow. But that’s a topic for a future article.
For now, the most important thing I’ve proven to myself is that building teaches you things that watching never could. I spent weeks stuck in tutorial mode and barely made progress. I spent one afternoon actually building something, and I grasp ETL better than any video ever made it seem.
Stop watching. Start building.
This is part two of my ongoing data engineering series. Stick around as I document every step of the journey, including the messy parts. Feel free to check out my more detailed ETL breakdown on my YouTube channel below.
Connect with me on LinkedIn, YouTube, and Twitter.



