When PyMuPDF Misses The Table: Unlocking PDF Parsing For RAG With Azure Layout

Here is the paraphrased version of the HTML content, keeping the structure intact while making the text clearer and more readable:

This companion piece is part of the Enterprise Document Intelligence series, which walks through building an enterprise RAG system using four core components. Article 5 (document parsing) introduced a parser built with PyMuPDF (fitz). This companion follows the same objective and uses the same relational database tables, but replaces the parsing engine with Azure Layout (the prebuilt-layout model), a more powerful solution that captures what fitz misses. That gap is exactly where we begin.

This companion’s position: it builds on Article 5 (document parsing), within Part II (the four core components), using a different parsing engine – Image by author

PyMuPDF (fitz) is quick, free, and highly accurate on straightforward text. However, it fails in three specific areas, and each of these is where enterprise RAG systems quietly fall apart.

Consider a table on page 14 of a contract. Fitz processes each cell individually and joins them together, losing the column structure entirely. A result like “Renewal fee 500 Setup fee 200” ends up in a single chunk, forcing your model to guess which number corresponds to which fee.

Think about a scanned amendment attached at the end of a document. Fitz reads the native pages but returns blank strings for the scanned ones. The user receives no response about the amendment because the parser never actually processed it.

Or take a figure containing embedded text, such as a chart with axis labels, a signed seal stamp, or a screenshot of a spreadsheet. Fitz only returns the bounding box of the image, and all the text within it is lost.

Azure Document Intelligence handles all three scenarios. It is a proprietary Microsoft Azure cloud service governed by Microsoft’s Online Services Terms. The prebuilt-layout model delivers native table cells (with rows, columns, and headers), OCR text for every page (whether native or scanned), figures with their internal text extracted, and paragraph roles (title, sectionHeading, figureCaption, tableCaption). All of this comes from a single API call. You get the same relational tables as with fitz, but roughly half of them are significantly enriched.

The downstream pipeline does not need to know which engine generated the data. Retrieval, generation, and annotation all work with rows from the database. They never interact with the PDF directly.

*The same tables, with Azure enriching half of them – Image by author*

1. Where fitz falls short

Four scenarios. In each one, fitz comes up short while Azure delivers.

1.1. Tables: fitz gives flat words, Azure gives structured cells

A contract table is organized into rows and columns. The label “Renewal fee” belongs in column 1, and the value 500 belongs in column 2. Fitz scans the page from top to bottom and outputs one line per text segment. The four cells in a single row appear as four disconnected words. Sometimes cells from the row below get mixed in if their y-coordinates are similar. The downstream chunker sees an unstructured jumble of words. The row-and-column layout that defines a table is completely lost.

Azure’s prebuilt-layout model identifies each table as a structured object. result.tables is a list of tables, each containing cells indexed by (row_index, column_index). The header row is clearly marked (cell.kind == "columnHeader"). Each cell’s content is the exact text the author entered. We convert the table into markdown rows so it fits into line_df alongside other content. A four-cell row like “Renewal fee | 500 | Setup fee | 200” becomes a single line_df row with that markdown text. The header row receives a | --- | --- | ... | separator so that a downstream model can reconstruct the structure.

1.2. Images: fitz gives the bounding box, Azure gives the text

Many PDFs contain figures with embedded text, such as architecture diagrams with labeled boxes, charts with axis ticks and legends, signed seal stamps, or embedded spreadsheet screenshots. Fitz returns each image as a bounding box along with the raw bytes, but the text inside remains invisible to the parser.

Azure’s OCR processes every page, including the pixels within figure regions. For each figure, we gather every Azure word whose bounding box falls inside the figure region and combine them as ocr_text. A string like "Multi-Head Attention Concat Linear h" now appears in image_df.ocr_text for the figure on page 4 of the Attention paper. This means retrieval can match a question about "multi-head attention" even when the answer is embedded as text inside a figure.

*fitz returns the bounding box with an empty text cell; Azure's OCR recovers the labels printed inside the figure – Image by author*

`1.3. Scanned pages: fitz returns nothing, Azure returns OCR text`

Imagine a 30-page native contract with a 10-page scanned amendment appended at the end. Fitz reads the native pages but returns empty strings for the scanned ones. The parser does not flag this issue. The downstream pipeline silently processes only 75% of the document, and the user has no idea that a quarter of the content is missing.

Azure runs OCR on every page regardless of its source. Both native and scanned pages come back through the same result.pages[i].lines path with an identical structure. The parsing_method column in line_df allows downstream code to identify which engine produced which rows. The parsing_summary dictionary includes a n_pages field that reflects the document's actual page count, not just the pages containing native text.

*A scan is made up of pixels, not characters; fitz has no text layer to read, while Azure applies OCR to the page – Image by author*

`1.4. Captions and headings: fitz relies on regex, Azure uses explicit roles`

Fitz identifies figure and table captions using regex patterns applied to the beginning of each line (^Figure d+b, ^Table d+b). This works when captions follow the format "Figure 2" but misses variations like "Fig. 2" or multi-line captions. It also produces false positives: a body-text sentence beginning with "Figure 2" gets incorrectly flagged as a caption when it is simply a reference.

*The two failure modes of caption-by-regex (a missed "Fig." caption and a body mention incorrectly flagged) that Azure's paragraph role system avoids – Image by author*

Azure's paragraphs field includes role labels: each paragraph in the result carries a tag such as "figureCaption", "tableCaption", "title",

Figure captions and table captions directly fill the object_registry. Titles and section headings, however, require rebuilding the TOC. In Azure’s layout model, the tag describes a block's function, something fitz doesn't offer. The (object_type, object_id) pairing used for joining is still pulled from the caption text via the same regex, so linking back via cross_ref_df remains unchanged.

The TOC reconstruction is where things get more involved. fitz’s build_toc_df pulls native bookmarks through doc.get_toc(). If the PDF lacks native bookmarks, fitz returns an empty TOC — which is typical for enterprise documents like Word exports, scanned files, or form-generated PDFs. Azure takes a different route by rebuilding the TOC using paragraph roles: every "title" paragraph becomes a top-level entry, and every "sectionHeading" becomes a second-level entry. The nesting is determined by their sequence in the document. It’s not a perfect solution, but it gives you a working TOC where fitz would give you nothing at all.

`2. Same interface, more detailed output`

A single function. The same collection of tables produced by parse_pdf, in identical format. One shared Azure call drives every builder. That call is simple: direct the SDK at the document with a single model_id set to prebuilt-layout. (The alternative prebuilt model, prebuilt-read, handles OCR only. The layout model is the one that also delivers tables, paragraph roles, and reading order.)

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.core.credentials import AzureKeyCredential

client = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))

# "Layout" refers to the prebuilt-layout model (not prebuilt-read, which only does OCR)
with open("contract.pdf", "rb") as f:
    poller = client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=f.read()),
    )

result = poller.result()   # produces tables, paragraph roles, OCR, and reading order

parse_pdf_azure_layout serves as the Azure counterpart to parse_pdf: identical call structure, same set of tables returned, allowing every downstream component to consume it without being aware of which engine was used. The function body is worth examining because it establishes the pattern every engine in this series follows: a single API call, followed by one small builder per table, with engine-agnostic builders reused for tables that only depend on line_df.

def parse_pdf_azure_layout(pdf_path):
    result = analyze_pdf(pdf_path)                          # single call to prebuilt-layout
    line_df  = azure_layout_pdf_to_line_df(pdf_path, result=result)
    image_df = build_image_df_azure_layout(result)          # adds ocr_text
    toc_df   = build_toc_df_azure_layout(result)            # built from paragraph roles
    object_registry = build_object_registry_azure_layout(result)  # from role tags
    page_df      = build_page_df(line_df)                   # reused fitz builder (line_df only)
    cross_ref_df = build_cross_ref_df(line_df)              # reused fitz builder (line_df only)
    return {"line_df": line_df, "image_df": image_df, "toc_df": toc_df,
            "object_registry": object_registry, "page_df": page_df,
            "cross_ref_df": cross_ref_df, "span_df": pd.DataFrame(),
            "parsing_summary": parsing_summary}

Walking through it from top to bottom: analyze_pdf triggers one Azure call, then a dedicated builder extracts each table from that shared result. The two tables that rely solely on line_df — page_df and cross_ref_df — are generated by the very same builders used by the native fitz parser. The returned dictionary is the standard contract every engine adheres to.

*The same table structure mirrors `parse_pdf`, showing per-row differences compared to fitz – Image by author*

`3. Improvements in each table`

`3.1. line_df adds table-cell rows, image OCR, and selection marks`

Consider a 4-column "Schedule of Charges" table. In line_df, this becomes 6 rows: the header row, the markdown divider row, and four data rows.

*Each source row maps to a `line_df` row, with column structure preserved inside the markdown text – Image by author*

Cells are kept within line_df rather than being split into a separate table_cells_df. This gives every downstream brick one unified table to work with — paragraph lines and table rows share the same format. The trade-off is that cell-level queries need a markdown-parsing step. For RAG use cases this is acceptable: the retriever matches keywords within the row text, and the LLM reads the markdown content directly.

Text extracted from inside images also appears in line_df as additional rows. Azure’s result.pages[i].lines already captures lines located inside figure regions, so the line builder picks them up naturally. Selection marks — checkboxes — are converted into single-character lines: [x] means selected, [] means unselected. This makes forms with checkbox fields fully searchable.

`3.2. image_df gains an ocr_text column`

The same rows, plus a new column. For each detected figure, every Azure word whose bounding box overlaps the figure region by at least 50% is collected and joined as ocr_text.

*Figures from the Attention paper with their labels exposed; text inside figures is now searchable – Image by author*

On an image_df produced by fitz, that same column stays empty. The fitz parser doesn't perform OCR on images. When the parsing_method is set to "fitz", the ocr_text column is still present for shape consistency but remains blank. Any downstream logic that checks for non-empty ocr_text works identically regardless of whether the data came from fitz or Azure.

`3.3. toc_df reconstructed from paragraph roles`

When native bookmarks exist, fitz's build_toc_df is both precise and cost-free — it simply reads the author's own outline. When bookmarks are absent (as is the case with the majority of enterprise documents), fitz returns an empty toc_df and downstream stages lose all section structure.

The Azure builder iterates through result.paragraphs, keeps only those with a role in {"title", "sectionHeading"}, and assembles a TOC from them. Level 1 corresponds to titles, level 2 to section headings. The nesting follows the order in which paragraphs appear across the document. The output retains the same columns as the fitz-generated TOC: start_page, end_page, start_y, and breadcrumb. The lookback pass that determines end_page — by finding the start_page of the next peer or ancestor headingGiven that the provided text is incomplete (it cuts off mid-sentence at the end), I will paraphrase the sections provided, maintaining the HTML structure and technical terminology (like `fitz`, `Azure`, `DataFrame` references) while simplifying the language for better readability.

`3.4. Object Registry Enhancements for Caption Detection`

The original 'fitz' method identifies captions using strict regex patterns at the start of a line (e.g., `^Figure d+b`). This often fails if the format varies slightly (like using 'Fig. 2.' instead of 'Figure 2'). In contrast, the Azure engine explicitly labels paragraphs as "figureCaption" or "tableCaption." We use these labels directly to improve accuracy, though we still use the same regex to extract the specific ID numbers to link objects across the document.

`3.5. New Document Statistics`

Three new metrics have been added to the document summary to help categorize content:

Table Count: Helps identify data-heavy documents like contracts.
Figure Count: Indicates visual content.
Selection Marks: Detects checkboxes (filled or empty).

These stats allow for better automated routing; for example, a document with zero figures probably doesn't need image-processing logic.

`3.6. Stability of Page and Reference Tables`

The `page_df` and `cross_ref_df` tables remain structurally identical regardless of which parsing engine is used. However, `span_df` (which captures details like bold or italic text) is currently only supported by the 'fitz' engine and will be empty when using Azure.

4. Engine Provenance via `parsing_method`

Each row in the output is tagged with its source engine (`azure_layout` or `fitz`). This 'provenance' tag is essential for "Adaptive Parsing." Initially, the system tries to parse with 'fitz' because it is faster. If a page looks complex—like a page where a table is detected but no rows were extracted—it can be re-parsed using the more powerful Azure engine. The `parsing_method` column then helps the system decide which version of the data to keep.

`5. Operational Costs and Performance`

While 'fitz' is practically instant and free, Azure introduces latency (2–4 seconds per page) and cost (around 1 cent per page). For a busy system, it is best to use Azure only for specific pages that 'fitz' couldn't handle perfectly to keep costs down.

1.3x, maintain referential similarity), re-OCR with Azure. The text_quality_score is calculated in pre_parse_signals and retrieved by the dispatcher.

A fourth signal is straightforward. If the document lacks a native TOC (fitz.toc_df.empty) and section context is needed for generation, run the document once through Azure to reconstruct a TOC — one fixed cost per document, not per query.

Article 10 assembles the complete dispatcher. The parsing_method column enables every downstream stage to identify which engine processed each row.

`7. Conclusion`

Two engines, one unified contract: identical relational tables in return, identical downstream code regardless of which engine executes.

*Every capability that matters for enterprise RAG, plus speed and cost – Image by author*

A parser delivers not text but a structured model of the document. Azure enriches that model (cell-level tables, OCR within figures, captions labeled by role, TOC reconstructed without bookmarks) in 2 to 4 seconds at roughly US$0.01 per page. Fitz incurs no cost and runs in milliseconds. The routing logic is simple: fitz by default, Azure whenever an upstream signal indicates fitz falls short. Article 10 wires up the dispatcher.

`8. Sources and further reading`

The prebuilt-layout model behind parse_pdf_azure_layout is documented by Microsoft. It builds on cell-level table extraction research (Smock et al. 2022) plus a paragraph-role layer that maps visual regions into structural roles. Docling (Article 5ter) is the open-source counterpart performing the same pipeline locally; it offers the same table contract on local hardware whenever documents cannot leave the premises.

Aligned with the article's direction:

Microsoft, Azure AI Document Intelligence: Layout model. The official documentation for prebuilt-layout, which underpins parse_pdf_azure_layout. The cell-level table output, paragraph roles, and OCR coverage all derive from this foundation.
Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The research powering Azure's cell-level table extraction — essential for understanding what azure_layout does internally.

A different angle, different use case:

Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). An open-source, locally hosted alternative to the Azure layout pipeline. Same table contract swaps cloud expense for local compute. The right fit when confidentiality policies prevent uploading documents to the cloud.

Earlier in the series:

Top Posts

Gate Launches RLUSD with Four Trading Pairs and a User Rewards Program

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

When PyMuPDF Misses the Table: Unlocking PDF Parsing for RAG with Azure Layout

`Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge`

`OWL’s Guide: 3D Spleen Segmentation with MONAI UNet on CT Volumes`

`Vision LLMs Double as Powerful PDF Decoders: Making Charts and Diagrams Retrievable for Smarter RAG Systems`

`4 Essential Lines Every Claude Skill Must Have`

`Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token by Nearly 10x`

`Parse PDFs Locally for RAG Using Docling: Extract Rich Tables Without Cloud Upload`

`Gate Launches RLUSD with Four Trading Pairs and a User Rewards Program`

`Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections`

`Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?`

`Anthropic Export Controls Spark Global AI Sovereignty Scramble`

`Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge`

`Reve 2.0 Review: The Best AI Image Generator for Layout Control`

`Army Data Center Initiatives Face Potential Setback Under House NDAA Clause`

`I tested dozens of Bluetooth trackers, but this one shocked me with its AirTag-crushing battery life`

`Trending`

`Gate Launches RLUSD with Four Trading Pairs and a User Rewards Program`

`Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections`

`Latest Posts`

`Not More Data, but Better World Models – Unite.AI`

`OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears`

Subscribe to Updates

Top Posts

When PyMuPDF Misses the Table: Unlocking PDF Parsing for RAG with Azure Layout

1. Where fitz falls short

1.1. Tables: fitz gives flat words, Azure gives structured cells

1.2. Images: fitz gives the bounding box, Azure gives the text

1.3. Scanned pages: fitz returns nothing, Azure returns OCR text

1.4. Captions and headings: fitz relies on regex, Azure uses explicit roles

2. Same interface, more detailed output

3. Improvements in each table

3.1. line_df adds table-cell rows, image OCR, and selection marks

3.2. image_df gains an ocr_text column

3.3. toc_df reconstructed from paragraph roles

3.4. Object Registry Enhancements for Caption Detection

3.5. New Document Statistics

3.6. Stability of Page and Reference Tables