Scenario: You’re part of the operations team at a mid-sized company. Each day, your team handles order forms from various B2B clients. These forms all come as PDFs, and in principle, they include the same core details: customer ID, purchase order number, delivery date, and the items ordered.
In reality, though, no two documents look quite alike. One client might place the purchase order number in the top-left corner, while another puts it in the bottom-right. Some label it “PO Number,” while others use “Order ID,” “Order Reference,” or something entirely different.
For us as humans, this isn’t usually an issue. We glance at the document, grasp the context, and instantly identify what each piece of information represents.
For conventional automation systems, though, this poses a real challenge: A regex rule can be written to specifically look for “PO Number: “. But what if the next client uses “Order Reference: “ instead?
That’s precisely the challenge I set up for this article.
We’ll compare two different methods for pulling structured data from B2B order forms:
- A traditional rule-based method using pytesseract and regex rules
- An LLM-driven method using pytesseract, Ollama, and LLaMA 3
The aim of this article isn’t to prove that LLMs are universally superior. They aren’t, in every case.
A far more compelling question is: At what stage do traditional extraction pipelines begin to hit their limits as complexity and the variety of layouts grow? And when can an LLM genuinely cut down on maintenance effort?
Table of Contents
1 – Step-by-Step Guide
2 – Head-to-Head Comparison
3 – When should we NOT use an LLM?
4 – Final Thoughts
Where to Continue Learning?
1 – Step-by-Step Guide
We’ll rebuild both methods from the ground up. First, we’ll generate two sample PDFs that contain identical business information but are formatted differently. Then, we’ll extract the data once using a traditional OCR-and-regex pipeline and once using an OCR-and-LLM pipeline. This lets us evaluate both methods under the same conditions.
- The traditional method essentially asks:
“Can I locate the exact pattern I programmed?” - The LLM-based method instead asks:
“Can I understand what this field means in context?”
→ 🤓 Find the full code in the GitHub Repo 🤓 ←
Before We Begin — Mise en Place
pip vs. Anaconda
In this guide, we use pip, Python’s default package manager. This means we install all libraries straight from the command line using pip install …. pip comes bundled automatically when you install Python. If you’ve seen Python tutorials that use Anaconda, that’s simply an alternative route to the same destination (using conda install …). In the article “Python Data Analysis Ecosystem — A Beginner’s Roadmap”, you’ll find more details on getting started with Python. Also, on a Windows machine, we use the CMD terminal (Windows key + R > type cmd).
Create and activate a new virtual environment
Set up a fresh Python environment with python –m venv b2bdocumentextractor (feel free to rename it) in a terminal, then activate it with b2bdocumentextractorScriptsactivate.
Optional: Verify Python and pip
python --version
pip --versionYou should see both a Python version and a pip version displayed.
Step 1 – Install Tesseract
Tesseract is the OCR engine. It’s the tool that actually reads text from images or scanned PDFs using OCR (Optical Character Recognition). pytesseract is merely the Python interface to Tesseract. In other words: Our Python code talks to Tesseract through pytesseract, but the actual text recognition is handled by Tesseract itself. Without installing Tesseract first, pytesseract won’t function.
First, download the latest .exe file for w64 and run the installer:
GitHub – Tesseract at UB Mannheim
Important: Take note of the installation path:
C:Program FilesTesseract-OCRIn the CMD terminal, confirm the installation with this command:
"C:Program FilesTesseract-OCRtesseract.exe" --versionIf everything went smoothly, you should see the Tesseract version number.
Step 2 – Install Poppler
Next, we install pdf2image. This is our library for turning PDFs into images, and it depends on Poppler behind the scenes. Poppler is an open-source PDF rendering library used for displaying PDF files.
To do this, download the latest version of Poppler, extract the ZIP file, and move the extracted folder to the C: drive.
GitHub-Poppler Windows Releases
Inside the folder, navigate to Library > bin and note the path where you placed the folder on your C: drive. On mine, it looks like this:
C:Usersschuepoppler-26.02.0LibrarybinNext, add this path to the PATH variable so Windows knows where to find Poppler.
Tip for Beginners:
Press the Windows key and search for Edit environment variables. Then click on Edit the system environment variables. Next, click on Environment Variables. Under User variables, select the PATH variable, click Edit, then New, and paste the path.
Now restart CMD so the changes take effect.

Step 3 – Install Python Libraries
Now we install all the Python libraries we need. Make sure you reactivate the Python environment first:
- pytesseract:
- pytesseract: This library acts as a connector between Python and Tesseract. While Tesseract serves as the OCR engine, it is only through pytesseract that Python can interact with it seamlessly.
- pdf2image: As an OCR engine, pytesseract identifies text from image pixels but is unable to interpret PDF files directly. That’s where pdf2image comes in—it converts each PDF page into an image, much like taking a screenshot, enabling pytesseract to process it afterward. Keep in mind: if the PDFs are digital (with selectable and copyable text), tools like pdfplumber or PyMuPDF could extract text directly. However, since B2B order forms are frequently scanned documents, we rely on pdf2image as an intermediary step.
- pillow: Both pdf2image and pytesseract depend on this image-processing library behind the scenes (though its usage isn’t visible in the code) to handle images properly.
fpdf2: This library is used to automatically generate two sample PDFs (Layout A and Layout B) for the article’s example via a script.
ollama: This library enables our Python script to communicate with the LLM and retrieve its responses.

Step 4 – Install Ollama and Download LLaMA 3
Once the library installation is complete, we proceed to set up Ollama and LLaMA 3 as the LLM. Ollama is a tool that allows you to run LLMs locally on your laptop at no cost and without requiring API keys.
First, install Ollama. If you haven’t done so yet, download the Windows installer from the Ollama website and run it.
Next, download LLaMA 3 by entering this command:
ollama pull llama3This process might take a while depending on your internet speed, as it involves downloading roughly 4.7 GB. A progress bar in the terminal will show the download status.

After the download is complete, confirm that everything is set up correctly:
ollama listIf you see output similar to the screenshot, the installation was successful.

Step 5 – Create the Project Folder and Generate Test PDFs
For this test, we generate two B2B order forms—one for Alpha GmbH and one for Beta AG—containing identical data but with different layouts. In this scenario, we assume the forms are scanned, which explains our earlier setup of pdf2image (for digital PDFs, libraries such as pdfplumber or PyMuPDF could be used instead).
First, create a project directory to organize all the files:
mkdir document_extractor
cd document_extractorThen, create a new file named create_test_pdfs.py and paste the code available in this GitHub Gist. Save this file in the document_extractor folder you just created.
Now go back to the terminal and run the file:
python create_test_pdfs.pyInside the folder, you should now see the two newly generated PDFs:

Looking at the two PDFs, you can immediately spot the issue:
- They contain the same data.
- However, they use entirely different field labels and date formats.
Approach 1: The Traditional Method (pytesseract + Regex Rules)
The conventional method involves two steps:
- First, the PDF is converted into an image, and pytesseract applies OCR (Optical Character Recognition) to extract the raw text by analyzing the pixels. Essentially, OCR works by detecting characters from images, similar to how a person reads handwriting.
- In the next step, regex (regular expressions) is applied to locate specific patterns within the text. For instance, you could set it up to extract everything following
PO Number:.
Right away in this step, a challenge becomes apparent: what if a customer uses “Order Reference” instead of “PO Number:”?
In such a case, the regex pattern would fail to find a match. The solution is to create a new rule to handle this variation.
Run Script 1 for Approach 1
Next, create a file called approach1_traditional.py and add the code from the GitHub Gist in the same folder:
Now execute the file in the terminal:
python approach1_traditional.pyOutcome of Approach 1
With Layout A, the method works perfectly:
However, for Layout B, none of the fields are identified, and all values show “None”:

This highlights the core limitation: for every new customer, unique regex rules must be written, tested, and deployed. With 200 customers, that translates to managing 200 distinct patterns. And whenever a customer slightly updates their form format, the entire system fails.
Approach 2: ANew Method (pytesseract + Ollama + LLaMA 3)
In this updated method, we continue using OCR but swap out the strict regex rules for a large language model (LLM):
- pytesseract continues to extract text from the PDF.
- Rather than instructing the code to “Look for PO Number:”, we prompt the LLM: “Here is a purchase order. Pull out these fields for me, no matter what they’re labeled.”
The LLM grasps the meaning behind the text. It can tell that “Order Reference” and “PO Number” refer to the same thing, even without a specific rule telling it so.
Run the Script for Method 2
Next, we create a new file named approach2_llm.py containing the code available in the GitHub Gist within the same directory:
We then run the file again from the terminal. Ensure Ollama is still active in the background:
python approach2_llm.pyOutcome of Method 2
What we observe now is that both document layouts are accurately interpreted:

For both layouts, data from fields with different labels is accurately pulled and mapped — all without tweaking a single regex pattern or building a new template. The LLM handles both layouts because it interprets the surrounding context. On top of that, the date format from Layout B is automatically converted to match Layout A’s format.
2 – Side-by-Side Comparison
After running both tests, one thing becomes immediately obvious: technically, both methods address the same challenge.
Each method comes with its own strengths and weaknesses:

With regex-driven pipelines, the complexity sits in the rules and upkeep. With LLM-driven pipelines, the complexity moves to infrastructure, processing time, and model behavior. For mid-sized businesses handling a wide variety of customer-specific document layouts, that trade-off can matter more strategically than raw extraction accuracy alone.
3 – When Should You Skip the LLM?
Lately, it can feel like every automation workflow needs to be swapped out for AI or an LLM.
In reality, though, that’s not always the smarter move. Mid-sized businesses especially don’t always need the “most cutting-edge” solution — they need one that stays reliable, easy to maintain, and cost-effective over time. Sometimes that means sticking with a traditional regex-based setup; other times, moving to an LLM is the better call.
Here are some scenarios where the traditional approach may still be the better fit:
- The documents are consistent and uniform:
If a business only deals with a handful of known layouts that seldom change, regex is often the stronger choice.Why?
Because the added value of an LLM is minimal, while the overall system becomes more complex.
A stable rule-based workflow, by contrast, is quicker, less expensive, simpler to troubleshoot, and easier to onboard new team members onto.
- Speed and volume are top priorities:
In our example, the LLM handles one document in about 20–40 seconds.That might sound fine at first. But picture a real production environment and the picture shifts quickly.
A mid-sized company likely processes orders, shipping notes, invoices, customs forms, support documents, and more — not 10 times a day, but 10,000 times a day.
At that scale, processing time becomes a genuine infrastructure concern. Regex-based systems run much faster, while LLMs demand more memory, more CPU/GPU capacity, and often extra queuing or batch-processing setups.
- Transparency matters more than adaptability:
In regulated fields like pharmaceuticals, insurance, banking, or healthcare, there’s often a need to fully explain why a particular value was extracted.Regex rules are fully deterministic: one line of code yields one clearly traceable result. LLMs, however, operate probabilistically — the model interprets the context and returns the most probable answer. That’s precisely what makes LLMs flexible, but it also makes them harder to audit.
- The business lacks the right infrastructure:
In our example, we used Ollama. Getting it up and running was fairly straightforward. Still, it’s important not to overlook that memory usage, GPU demands, monitoring, and response times under heavy loads can look very different in a real LLM deployment.
On my Substack Data Science Espresso, I share hands-on guides and quick updates from the world of Data Science, Python, AI, Machine Learning, and Tech — designed for curious minds like yours.
Check it out and subscribe on Medium or Substack if you’d like to stay updated.
4 – Closing Thoughts
Picking the right approach isn’t purely a technical decision — it’s a strategic one.
The traditional method attempts to explicitly define every possible document variation. The LLM-based method instead focuses on understanding meaning and context. For small, stable setups, the traditional approach is often entirely adequate. As more layouts and edge cases emerge, keeping the rules maintainable gets harder. That’s exactly the point where LLMs start to make sense.
This can also be a great first project for a company looking to dip its toes into LLMs — building AI readiness and gaining hands-on experience in the process.



