This companion piece is part of the Enterprise Document Intelligence series, which walks through building an enterprise RAG system using four core components. Article 5 (document parsing) introduced a parser built with PyMuPDF (fitz). This companion maintains the same objective and the same relational tables, but replaces the engine with Docling—a more powerful library that captures table cells, performs OCR, and extracts captions that fitz overlooks, all while running entirely on your local machine. The significance of that last point is where we begin.
This companion’s place in the series: it builds on Article 5 (document parsing), within Part II (the four core components), using an alternative parsing engine – Image by author
The most capable parser available can interpret tables, scanned pages, and text embedded within figures. But it also requires sending your documents to a third-party cloud service.
For many enterprise use cases, that’s a dealbreaker. Think about the insurance policy on your desk, the patient medical record, the M&A data room, or the signed employment contract. Legal teams won’t permit those files to leave the facility, let alone be transmitted to an external cloud. The most sophisticated parser in the world is worthless if compliance prevents you from uploading the document in the first place.
Docling represents the other side of the solution. It’s an open-source document parser from IBM Research (MIT license, as stated in the project’s LICENSE file on GitHub): it handles layout detection, OCR, reading order, and TableFormer (IBM’s deep-learning model that identifies table structure—rows, columns, and headers—without relying on regex). Everything installs with a simple pip install and runs locally. The first invocation downloads the models to a local cache; all subsequent calls work offline. No API key, no per-page fees, and the document never leaves your machine.
And the output maps to the same relational tables as fitz and Azure. The downstream pipeline doesn’t distinguish which engine generated the dict. Retrieval, generation, and annotation all operate on rows—they never interact with the PDF directly.
Identical tables, with Docling enhancing roughly half of them—all processed locally – Image by author
1. The limitation is the cloud, not the parsing capability
Article 5 bis made the argument for more sophisticated parsing: tables that preserve their column structure, OCR on scanned pages, text extracted from within figures, and headings identified even when the PDF lacks bookmarks. None of that reasoning changes here. What changes is where the processing takes place.
Azure DI is a managed cloud service. You submit document bytes, and it returns structured data. For a publicly available arXiv paper, that’s perfectly acceptable. For the documents that populate a real enterprise archive, it often isn’t:
Confidentiality: Insurance policies, medical records, contracts covered by NDAs, or any document containing personal data. Transmitting these to a third-party API constitutes a data-processing event that requires legal approval—and that approval is frequently denied.
Data residency: “The data must remain within this region” is a standard contractual requirement in many industries. A cloud parser operating in the wrong region violates that obligation.
Air-gapped environments: Some networks have zero outbound internet connectivity. A cloud API call isn’t just slow in those settings—it’s completely impossible.
Cost at scale: A few cents per page is negligible for a thousand pages but becomes a significant expense at ten million.
The capability is identical; the distinction is whether the document crosses into a metered cloud service or remains on your own infrastructure – Image by author
Docling addresses all four concerns in the same way: the model executes where the document already resides. The tradeoff shifts from cost and trust to compute resources and initial setup. You invest in CPU time and a one-time model download rather than ongoing per-page charges and compliance reviews. For a sensitive document corpus, that’s the tradeoff you’d prefer.
The remainder of this article follows the same structure as Article 5 bis, because the underlying contract is identical. Where Docling diverges from Azure in specific details, those differences are explicitly noted.
2. Same contract, executed locally
A single call produces the same tables as the fitz parser, in the same format, all from one local Docling conversion. The Docling SDK call itself is straightforward: instantiate a DocumentConverter, pass it a file path, and retrieve a DoclingDocument. The first invocation downloads the layout and TableFormer model weights to a local cache; every subsequent call operates offline.
from docling.document_converter import DocumentConverter
converter = DocumentConverter() # lazy: no model loaded yet
result = converter.convert("data/paper/1706.03762v7.pdf")
doc = result.document # a DoclingDocument
# what a DoclingDocument provides
doc.export_to_markdown() # entire document as markdown
doc.tables # list of TableItem (each includes .data.table_cells)
doc.pictures # list of PictureItem (bbox + optional ocr / classification)
doc.texts # list of TextItem, tagged as title / section_header / paragraph / formula / caption
That DoclingDocument is what every builder function in this article processes. parse_pdf_docling wraps the call above and transforms the document into the same dict of tables that every other engine produces, so downstream components consume the output without awareness of which engine was used. Here’s how you invoke the wrapper.
out = parse_pdf_docling("data/contracts/MyContract.pdf")
out["line_df"] # text items + table cells + checkboxes
out["page_df"] # one row per page
out["image_df"] # pictures, ocr_text + classification
out["toc_df"] # reconstructed from layout labels
out["object_registry"] # captions identified by label
out["cross_ref_df"] # body-text references (regex)
out["span_df"] # empty (no sub-line typography)
out["parsing_summary"] # doc-level summary dict
parse_pdf_docling is the local counterpart of parse_pdf: identical invocation pattern, identical dict of tables as output, so every downstream component reads it without knowing which engine was responsible. The implementation is worth examining, because it illustrates the pattern every engine in the series follows: convert once, then one small builder function per table, and reuse the engine-agnostic builders for tables that only require line_df.
Reading it from top to bottom: a single convert_pdf call runs the models once, followed by one lightweight builder per table (each consuming that shared doc). The two tables that depend solely on line_df — page_df and cross_ref_df — are generated by the exact same fitz builders used by the native parser. The dictionary returned at the end represents the standard contract that every parsing engine must fulfill.
The same tables mirror parse_pdf, with the real shapes from a Docling run on the 15-page Attention paper – Image by author
What that single conversion actually does. It's easy to lump Docling in with "OCR tools," but that's misleading — OCR is just one optional stage within the pipeline. A convert() call first runs a layout model (which identifies regions such as tables, figures, headings, and body text, along with their reading order), then applies TableFormer to every detected table (extracting the grid of rows, columns, and headers), and only invokes an OCR engine if the page is a scanned image with no text layer. For born-digital PDFs, the OCR stage is skipped entirely: cell text is pulled directly from the native text layer. This means a table's markdown output reflects TableFormer's structural understanding, populated with cell text that — on a native PDF — was never processed by OCR at all. The OCR engine you choose (EasyOCR, PaddleOCR, Tesseract, or RapidOCR) only affects how scanned pixels are interpreted, while the quality of table extraction itself is governed by TableFormer's mode (fast vs. accurate), not the OCR backend.
Docling is a pipeline, not an OCR wrapper: layout analysis and TableFormer handle structure; OCR only reads scanned pixels – Image by author
3. What each table gains
To make this concrete and verifiable, we ran Docling on the Attention Is All You Need paper (Vaswani et al., 2017; arXiv non-exclusive distribution license, as stated on the arXiv abstract page) — the same public arXiv PDF used throughout this series. It's 15 pages, born-digital, with no native bookmarks, four real tables, six figures, and five display equations. A document where fitz already handles the prose well but struggles with tables and section structure. You can swap in your own PDF and the same builders will run; the figures below reflect what Docling produced on this particular file.
3.1. line_df gains table-cell rows, figure text, and checkboxes
Docling's TableFormer model identifies each table as a grid of cells, complete with row/column indices and header flags. We flatten that grid into markdown rows so the table is embedded directly within line_df like any other content — one line per table row, with a | --- | separator after the header. Table 1 of the paper (a 5-row, 4-column complexity comparison) produces six line_df rows: five data rows plus the | --- | separator following the header.
The flattening logic is compact, and worth examining because it's the core technique: create an empty rows × cols grid, place each TableFormer cell into its (row, column) position, then join each row into a single markdown line.
def table_to_markdown_rows(table):
n_rows, n_cols = table.data.num_rows, table.data.num_cols
grid = [[""] * n_cols for _ in range(n_rows)]
header = set()
for cell in table.data.table_cells: # the cells TableFormer found
row, col = cell.start_row_offset_idx, cell.start_col_offset_idx
grid[row][col] = cell.text.strip()
if cell.column_header:
header.add(row)
h = min(header) if header else 0
rows = ["| " + " | ".join(grid[h]) + " |", # header row
"| " + " | ".join(["---"] * n_cols) + " |"] # separator
rows += ["| " + " | ".join(grid[r]) + " |" # data rows
for r in range(n_rows) if r != h]
return rows # one markdown line per source row → one line_df row each
Each source row becomes a line_df row; the column structure is preserved within the markdown text. These are the actual rows Docling produced for Table 1 – Image by author
We keep table cells inside line_df rather than introducing a separate table-cells table. One DataFrame serves every downstream component — paragraph lines and table rows have identical output formats. The trade-off: per-cell queries require a markdown parsing step. For RAG workloads this is perfectly acceptable. The retriever matches keywords against the row text, and the model reads the markdown directly. This mirrors the same design decision made by the Azure builder, ensuring that downstream chunkers handle fitz, Azure, and Docling table rows uniformly.
Two additional sources contribute to line_df. Text that Docling detects inside a figure region is captured as ordinary text rows (recovered through layout analysis + OCR), so labels rendered within a diagram become searchable. Checkbox items are represented as single-character lines — [x] for selected and [ ] for unselected — making form fields queryable. On the Attention paper, the line count grows from the raw prose baseline to 560 rows once the four tables are flattened in.
3.2. image_df gains ocr_text and a classification column
Same rows, two new columns. For each detected image, we gather every text item whose bounding box overlaps the figure region by at least 50% and concatenate them as ocr_text. The architecture diagram on page 3 and the two attention diagrams on page 4 contain labels embedded within the figure; those labels appear in ocr_text and become retrievable.
The Attention paper's figures with their internal labels exposed – Image by author
The second new column is classification. Docling includes an optional image classifier that tags each figure (by chart type, logo category, and so on). When the classifier is enabled, the tag populates classification; when it isn't, the column remains present for schema compatibility but stays empty. Azure has no equivalent feature, so this is one area where Docling goes further. The same column
When working with a fitz-generated image_df, it's important to note that this table doesn't actually exist. Instead, fitz provides width_px, height_px, and image_hash, and it never performs OCR on images.
3.3. Reconstructing toc_df from layout labels
The Attention paper lacks built-in bookmarks. If you run fitz's build_toc_df on it, you'll get an empty table. This is a typical scenario in enterprise settings—Word exports, scanned documents, or anything not created in LaTeX with hyperref configured. As a result, the generation process loses the section structure.
Docling directly labels every heading:
A title item for the document title, when detected
A section_header item for each section heading (in this paper, Docling even tagged the title as a section heading)
The builder processes both labels, assigns a level, and constructs a TOC using the same start_page, end_page, start_y, and breadcrumb columns as the fitz approach. The lookback pass that calculates end_page is identical to the fitz and Azure methods; only the source of the rows differs.
On the Attention paper, Docling recovers 28 headings while fitz recovers none. This number isn't inflated—Docling tagged each section heading once, including sub-sections under Method and Results, which is accurate for this paper. On a document with shorter, denser sections, the count would be lower.
28 headings recovered from layout labels on a PDF with no native bookmarks – Image by author
The depth of the hierarchy depends on the document. When Docling's layout model assigns distinct heading levels, you get a genuine multi-level tree. On this paper, it labeled all headings at a single level, so the reconstructed TOC is mostly flat. Either way, it provides a usable section index where fitz would return nothing. Azure achieves the same result using its own role tags; the two are comparable, with Docling running locally.
3.4. Caption detection in object_registry
Fitz identifies captions using regex patterns anchored to the start of a line, such as ^Figure d+b or ^Table d+b. This misses formats like Fig. 2 and multi-line wraps, and it can false-positive on body sentences starting with "Figure 2 shows…".
During layout analysis, Docling assigns a caption label to caption blocks. We read this label directly—no regex needed to locate the caption. The (object_type, object_id) join key into cross_ref_df is still extracted from the caption text using the same regex that fitz and Azure builders use, ensuring consistent joins across engines. On the Attention paper, this captures all nine captions (Figures 1 through 5, Tables 1 through 4) in object_registry. The advantage is recall: Docling catches captions that fitz's line-start regex would miss.
3.5. Docling-specific stats in parsing_summary
Three metrics are added to the document-level synthesis dictionary:
n_tables_detected: the number of tables TableFormer identified (4 on the Attention paper).
n_pictures: the number of figures the layout model detected (6).
n_formulas: the number of display equations Docling tagged as formula (5).
These metrics simplify routing. A document with n_tables_detected = 18 likely resembles a contract where table structure is critical. A document with n_formulas in the dozens is math-heavy and may benefit from a formula-aware downstream process. A document with n_pictures = 0 is text-only, so there's no reason to scan figures for inline references.
3.6. page_df and cross_ref_df: no changes
Two tables remain unchanged in structure. page_df and cross_ref_df are built solely from line_df, so the engine that produced line_df doesn't matter. One implementation works across three engines without any drift.
Under Docling, span_df is empty, just as it is under Azure. The layout model doesn't expose sub-line typography (such as per-word bold or italic). When you need spans for heading detection or term emphasis, stick with fitz for that document. The engines complement each other.
4. The parsing_method column: tracking provenance for adaptive parsing
Every per-row table from parse_pdf_docling includes parsing_method == "docling". Tables from parse_pdf carry "fitz", and those from parse_pdf_azure_layout carry "azure_layout". Same column, same name, across all engines. The purpose is for downstream use.
Contract parsed with fitz, the table page re-parsed with Docling; both engines coexist in line_df via the parsing_method column – Image by author
This is what adaptive parsing (Article 10) relies on. The default pass uses fitz. Pages that fail a pre-parse check—such as a table region with no extracted rows, an image-heavy page with sparse text, or an OCR layer with low quality—get re-parsed by a more powerful engine. With Docling, that re-parse runs locally, so it remains available even when the document can't be sent to the cloud. The re-parsed rows replace or append to the original line_df rows, and the parsing_method column maintains the audit trail.
Three downstream patterns enabled by this column:
De-duplication: when the same page was processed twice, prefer the heavier engine's rows over fitz's using an explicit precedence map.
Audit: a row with parsing_method == "docling" indicates that a model, not plain text extraction, produced it—useful for confidence weighting in answers.
Routing accounting: tracking which pages required the heavy path and how long they took.
5. Cost, latency, and setup
Docling is free to run, but not free to operate. Three factors matter.
Latency: On CPU, processing a single page through the full Docling pipeline (layout + TableFormer + OCR) takes roughly 1 to 5 seconds depending on page complexity. The 15-page Attention paper, with OCR enabled, parsed in well under two minutes on a laptop CPU. A GPU reduces this significantly. Fitz parses the same document in under a second. So the routing rule mirrors Azure's approach: parse with fitz first, and escalate to Docling only on pages fitz handled poorly. The difference from Azure is that the escalation costs CPU time, not money or a network round trip.
Setup:
The initial conversion fetches the layout and TableFormer models (hundreds of megabytes) to a local cache, and the docling installation includes PyTorch, which is substantial. Plan for the disk space and the one-time download. After that, it operates entirely offline. In an air-gapped setup, you pre-load the model cache; there is no runtime connection to external services.
Compute, not per-page charges: There are no per-page fees. The expense is the hardware you run it on. For ten million pages annually on sensitive data, owning the compute is typically more cost-effective than a per-page cloud subscription, and it is the only viable option when the data cannot leave the premises at all.
These figures shift with hardware and Docling versions. The pattern is what counts: fitz is virtually free and instantaneous, Docling requires seconds of local compute and a one-time setup, Azure charges cents per page and involves a network hop to a cloud you must trust.
6. When to use which
Start with fitz by default. Move to a heavier engine only when a specific signal indicates fitz falls short, and choose the engine based on where the document is permitted to go.
fitz: every parse, as the default. Born-digital PDFs with selectable text and straightforward layout. Free, instant, offline.
Docling: when fitz falls short (tables, scanned pages, figure text, missing bookmarks) and the document is confidential or the environment is air-gapped. Local, free to run, nothing leaves the machine. Also the right default when you prefer owning compute over paying per page.
Azure DI: when fitz falls short and sending the document to a cloud is acceptable, and you prefer a managed service over running models yourself. Per-page cost, zero infrastructure to maintain, fastest to integrate.
The signals that trigger escalation are the same ones Article 5 bis outlined: a detected table region with no row-like structure, an image-heavy page with sparse text, a low OCR-quality score, or a document with no native table of contents where generation requires section context. Article 10 builds the dispatcher that reads those signals. The parsing_method column is what allows every downstream stage to know which engine processed which row.
7. Conclusion
The same relational tables, regardless of which engine populates them. Capability rows are nearly even between Azure and Docling; the deciding rows are operational. Azure sends the document to a cloud and charges per page. Docling keeps the document on the machine and charges nothing but compute. Fitz does neither and costs nothing.
Every capability that matters for enterprise RAG, plus where the computation runs, speed, and cost – Image by author
8. Sources and further reading
Docling is documented in the IBM Research technical report (Auer et al. 2024), which describes the layout pipeline, the TableFormer cell-detection model, and the reading-order step. The cell-level table extraction Docling inherits has its own research lineage (Smock et al. 2022, PubTables-1M / Table Transformer). The relevant companion reading for this article is Article 5bis (Azure DI), which delivers the same table contract from a paid cloud service: same capability, different operational profile.
Same direction as the article:
Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). Reference architecture for the local layout pipeline this article uses: layout detection, TableFormer, reading-order, unified document representation.
Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The research lineage behind the cell-level table extraction both Docling and Azure provide.
Different angle, different context:
Microsoft, Azure AI Document Intelligence. Layout model. Paid cloud equivalent of the same cascade (Article 5bis). Same table contract; trades local compute for cloud upload and per-page cost. The right choice when the operations team prefers a managed service over hosting model weights locally.
The question is rarely "which parser is best"; it is "what is this document allowed to do, and what does this page need." A born-digital page with clean prose: fitz. A table page in a public report: Azure if you want it managed, Docling if you want it local. Article 10 wires the dispatcher that makes the call per page.
Earlier in the series:
Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren't Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
Rerankers Aren't Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document's nature, signals, and summary.
Stop returning flat text from a PDF: the relational shape RAG needs (link to come). The second half of the parsing brick: the relational tables every downstream brick reads.
When PyMuPDF can't see the table: parse PDFs for RAG with Azure Layout (link to come). The same tables from Azure Layout: native table cells, OCR, paragraph roles.