I embarked on this journey partly because data engineering is one of the most in-demand and lucrative careers today. I won’t deny that this played a role in my decision.
But there’s more to the story.
I’ve been studying data analytics for quite some time now. SQL, Power BI, Python (Pandas, NumPy, a bit of Polars), data cleaning, exploratory data analysis — you name it, I’ve been deep in the trenches. And I truly enjoy it. But at some point, I became fascinated by what happens before the data reaches my desk. How does it travel? Who constructs those pipelines? What does the underlying infrastructure actually look like?
That curiosity took root.
Then AI began automating much of what I do, making it faster and simpler. Which is wonderful. But it also made me wonder: if AI can handle the analysis, what sets me apart? What can I build and comprehend that goes deeper? I work as an IT System Analyst at a startup, and while I like the work, I realized I wasn’t pushing myself the way I wanted to. I was ready for a bigger challenge.
The final nudge came from a video by Data With Baraa, where he presented a comprehensive data engineering roadmap. Something about seeing it organized and laid out made it feel tangible and achievable. So here I am.
I’m learning data engineering publicly. And this article marks the start of that journey.
Also, a quick disclaimer — I’m not affiliated with Data with Baraa. I’m simply sharing my personal experience. Hope it’s useful.
Why Data Engineering Specifically
I want to pause here because I think this question deserves a thoughtful answer.
Data analytics taught me how to work with data once it arrives. Clean it, explore it, visualize it, extract insights from it. That skill set is genuinely valuable. But the more I learned, the more I kept running into the same barrier. The data I was working with had already been transformed and delivered by someone else. Someone had built the pipeline that brought it to me. Someone had determined how it was stored, how it was organized, how frequently it updated.
I wanted to be that person.
Data engineering sits upstream from analytics. It’s about constructing the systems that make analysis possible from the start. Data pipelines, storage architecture, workflow orchestration, large-scale data processing — these are the foundations everything else rests on. And honestly, that kind of infrastructure work appeals to me in a way that pure analysis no longer does.
There’s also a practical consideration. Data engineering positions consistently rank among the highest paying in the data field. As AI tools become more capable of automating the analytical layer, the need for people who can build and maintain reliable data infrastructure will only increase. I’d rather be constructing the pipelines than simply using them.
And one more thing — the startup where I work doesn’t use any of the tools I’m about to learn. Which means every hour I invest in this is completely self-directed. No team to learn from, no work projects to apply it to. Just me, the internet, and whatever I can build independently. That’s a challenge I’m choosing deliberately.
Why I’m Doing This in Public
Writing about what I learn is something I already believe in strongly. It forces you to truly grasp something before you can explain it. It keeps you accountable. And over time, it creates something that a resume alone never could.
But I’ll be upfront about my concerns too, because I think that’s the whole point of doing this publicly.
I struggle with shiny object syndrome. There, I admitted it. I’ve dabbled in graphic design, animation, writing, marketing, and IT before settling into data. There’s always something new and exciting tugging at my attention. Data engineering could easily be replaced by the next eye-catching thing in my feed if I’m not deliberate about it.
Consistency is another challenge. I work a 9-to-5 where I rarely touch the tools I’ll be studying. There’s no natural reinforcement at work, no colleague I can discuss Airflow questions with. I’m building this entirely on my own time, outside of my job responsibilities.
And balance. Three to four hours a day is the target. Some days that will feel effortless. Other days it will feel impossible.
Publishing this journey is my accountability mechanism. If I go silent, you’ll know I slipped. And I’d rather not slip.
What I’m Starting With
I’m not starting from scratch, which is helpful. I already have beginner-to-intermediate SQL knowledge from my data analytics background, basic Python fundamentals, and some hands-on experience with Pandas. That gives me a base to build on rather than starting over.
Here’s the complete learning stack, roughly in the order I’ll be tackling it.
1. SQL: Going Deeper Than Analytics
I know SQL. But analytics SQL and engineering SQL are different beasts. I’ll be diving deeper into query optimization, indexing, working with massive datasets, and writing SQL that’s built for performance rather than just exploration. If you’ve only ever used SQL to pull and filter data, there’s an entire layer beneath that’s worth exploring.
Why it’s first: Everything in data engineering eventually involves SQL. Sharpening this skill before adding more complex tools makes the rest of the journey smoother.
2. Python: From Exploratory to Production-Ready
I have the basics. Pandas, NumPy, some Polars. But the Python I’ve been writing lives mostly in notebooks. Exploratory, messy, not built to endure. The goal now is to write cleaner, more structured, reusable code. Functions, modules, error handling, scripting — the kind of Python you’d actually deploy in a pipeline.
Why it matters: Python is the glue that holds most modern data engineering stacks together. Airflow uses it. PySpark is built on it. Getting comfortable here is essential.
3. Git and GitHub: Version Control Done Right
I’ll be honest — my Git knowledge is currently “copy the command, hope it works.” That needs to change. Version control is fundamental to working like an engineer rather than just an analyst. I’ll be learning branching, pull requests, and how to manage code properly across projects.
Why it matters: Every project I build from here on goes on GitHub. It’s a portfolio, it’s discipline, and it’s how real teams operate.
4. Apache Spark and PySpark: Big Data Processing
This is where things get genuinely exciting. Apache Spark is one of the most widely used engines for processing large-scale data.
PySpark serves as the Python interface for Spark, allowing me to leverage a language I already have some familiarity with to process distributed data at scale.
Moving from Pandas to Spark requires a fundamental shift in thinking. Pandas operates on a single machine, while Spark is designed to run across clusters. Adopting this distributed mindset is one of the key skills that distinguishes data engineers from data analysts.
Why it matters: When it comes to handling big data in a production setting, Spark is practically essential. It appears frequently in job listings and forms the backbone of the Databricks ecosystem I plan to build expertise in.
5. Apache Airflow: Orchestrating Data Pipelines
Data pipelines require management. You need a tool to schedule them, monitor their execution, and handle any failures smoothly. This is the role of workflow orchestration tools, and Airflow is my choice.
I evaluated several options. Databricks Workflows works well if you’re already invested in the Databricks environment. Azure Data Factory is a strong fit for organizations heavily using Azure. However, Airflow is free, open-source, works across any cloud provider, and is widely adopted in the industry. It also teaches foundational orchestration concepts that apply to other tools. Beginning with Airflow felt like the right decision, particularly since I’m aiming to minimize expenses.
Why it matters: Orchestration is what transforms a set of disconnected scripts into a cohesive pipeline. Grasping Airflow means understanding how real-world data workflows are managed in production.
6. Databricks: The Data Platform
Eventually, you need to commit to a data platform and develop deep expertise in it. I’ve chosen Databricks. It’s built on Spark, highly sought after in the job market, and offers a free Community Edition that lets you practice without spending on cloud credits.
The alternatives are strong as well. Snowflake is a sleek, high-performance SQL warehouse that many companies prefer. BigQuery is Google’s fully managed, serverless solution and is genuinely impressive if you’re leaning toward Google Cloud. But Databricks uniquely sits at the crossroads of big data, machine learning, and data engineering, which aligns perfectly with my career direction. It was the most logical fit for my objectives.
Why it matters: Employers value hands-on platform experience. Developing deep expertise in one platform is more impactful than having surface-level knowledge of several.
How I’m Structuring the 12 Months
The truth is this might take longer than 12 months, and I’m fine with that. I’d rather spend 15 months and truly understand the material than rush through in 12 and have a shaky grasp of the fundamentals.
My general strategy is to work through each skill sequentially and not move forward until I’ve built something with what I’ve just learned. Tutorials are useful for getting oriented, but real learning happens through projects. My plan is to document each stage here on Towards Data Science: the concepts, the projects, the challenges, and the breakthroughs.
For tracking my progress, I’m using the Notion roadmap from Data With Baraa as my foundation. It breaks each skill into core topics and helps me monitor my progress without feeling overwhelmed by the full scope all at once.
As for time commitment, I’m aiming for three to four hours daily. Some of that will be structured learning, some will be hands-on building, and some will be writing about what I’ve learned, which is itself a powerful form of study.
What Success Looks Like
Securing a well-paying data engineering position is the goal. That’s the reality, and I won’t sugarcoat it.
But beyond that, I want to establish myself as a credible voice in this field. Someone who builds meaningful projects, shares the journey honestly including the difficult parts, and perhaps makes the path a bit clearer for those following behind me.
The writing and the learning reinforce each other. The portfolio becomes the evidence. The evidence builds the reputation. That’s the vision.
Starting Today
This article marks my official starting point. I’m not waiting until I feel fully prepared or until every detail is perfectly mapped out. I’m beginning now, writing as I progress, and keeping the process public and imperfect.
If you’re on a similar journey, whether you’re in analytics considering a move into engineering, in IT exploring your next step, or simply someone looking to build skills that retain their value in an AI-driven world, follow along.
I think we’ll have plenty to discuss. I’ll also be sharing my learnings on my YouTube channel, so feel free to subscribe below and join me.
This is the first article in an ongoing series documenting my data engineering journey. I’ll be publishing regularly about my progress, the projects I’m building, and everything I learn along the way.
And if you’d like access to the Notion template, in case you’re on the same journey, you can find it here.
Follow along on my journey below.
YouTube
Medium



