Unlocking The Hidden Connections: Why Relational Shape RAG Demands More Than Flat Text From PDFs

This is the fifth installment in the Enterprise Document Intelligence series, which walks through the construction of an enterprise-grade RAG system built on four foundational components: parsing, question parsing, retrieval, and generation. Parsing is the first step, and this article is the second half of that topic. The earlier piece explained how a PDF is transformed into line_df, a structured table with one row for each line of text on a page. This article picks up from there, covering the complete set of tables a parser should produce, what information each one contains, and how they connect to one another — ensuring, for example, that a table on page 14 retains its column layout and that a renewal fee remains correctly paired with its label. The remaining three components, along with the highlighted answer delivered at the end, all draw from these tables rather than the raw PDF.

This article’s place in the series: Article 5, covering the data-model half of the parsing component, within Part II (the four foundational components) – Image by author

Most RAG tutorials begin with the same approach: text = extract_text(pdf). That single line is where PDF-related issues start.

You set up a RAG pipeline. It performs well on a handful of clean documents. Then a client sends over a real-world contract: 30 pages long, with a Schedule of Charges table on page 14. A user asks “what’s the renewal fee?” and the model returns an incorrect figure.

The team concludes: “the model can’t read tables.”

The model handles tables just fine. The issue lies earlier in the pipeline. Your parser processed the table cell by cell and stitched everything into a single long string. The column organization was lost. The connection between a label and its corresponding value was lost. Your model is left to guess which number represents the renewal fee. Occasionally it guesses correctly. More often, it doesn’t.

*The same four rows, merged cell by cell into one chunk. EUR 200 one-time, Late payment 75: which value belongs to which label? – Image by author*

The parser didn’t malfunction. It delivered exactly what was requested. You requested the wrong output.

A well-designed PDF parser doesn’t simply pull out text. It represents the document as a relational collection of tables. One PDF goes in; one table per type of content comes out (seven or eight for now, with more added as new requirements emerge).

toc_df: the document’s sections, organized as the author intended.
page_df and line_df: the main body of text. Every page. Every line.
image_df: every figure found on every page.
span_df: formatting details such as bold, italic, color, and font size. Every span within every line.
object_registry: every figure caption, table caption, and annex reference.
cross_ref_df: every “see Figure 2”, “see Table 4”, and “see Annex B” reference.
parsing_summary: indicates whether the PDF is born-digital, scanned, or a mix, and whether the OCR quality is good or poor.

Retrieval draws from these tables. Generation draws from these tables. Highlighting draws from these tables. You open the PDF a single time. From that point forward, you work exclusively with tables.

This article examines each table in depth, then runs parse_pdf on two very different PDFs side by side to demonstrate that the same set of columns handles both. The previous article (“Beyond extract_text: the two layers of a PDF that drive RAG quality”) covers the upstream side: the explicit signals the parser reads first and the page-level classification it performs before any line is numbered.

How each table is generated: line_df, parsing_summary, toc_df, and image_df are produced directly by the parser; page_df, span_df, object_registry, and cross_ref_df are derived from line_df – Image by author

1. One table per entity

Everything extracted from the document is returned as a dictionary of tables along with a parsing summary, with one table representing each entity in the document model.

The _df naming convention makes the level of granularity clear just from the name. The diagram at the top of this article illustrates how each table is generated. Four are produced directly by the parser: line_df (the text lines), parsing_summary (the document-level overview), toc_df (the native outline, obtained via doc.get_toc), and image_df (via page.get_image_info). The remaining four are derived from line_df: page_df aggregates it by page, while span_df, object_registry, and cross_ref_df are extracted from its lines. How these tables join with one another is addressed separately in section 2.

1.1. toc_df: table of contents

Tables of contents are a staple of enterprise documents. Contracts, reports, policies, employee handbooks, and regulatory filings almost always include a defined section hierarchy, and that hierarchy is the most straightforward semantic signal you can pass to a retriever.

The complication: it isn’t always native. Sometimes it exists only as typographic styling (bold headings, numbered sections, indented subheadings) and must be rebuilt from line_df and span_df.

Here we focus on the native case (typical for born-digital exports from LaTeX, Word, and InDesign); reconstructing a TOC from typographic cues when bookmarks are missing is a topic of its own, outlined by an adaptive parser and explored fully in a dedicated follow-up article.

*Declared outline with `parent_idx` and `breadcrumb`; empty when no native bookmarks exist – Image by author*

How to build it: build_toc_df(doc) calls doc.get_toc(simple=False) (producing one entry per bookmark, with the destination dictionary attached) and iterates through the result to compute parent_idx, breadcrumb, end_page, and start_y. When run on the Attention paper, it yields the 22 entries already shown in section 1.2 above: three levels of headings, native bookmarks, no reconstruction required.

The implicit end_page convention: TOCs indicate where sections begin but almost never where they end. build_toc_df materializes the end as a column regardless: for each row, end_page is set to the start_page of the next entry at the same level or a shallower one (the next peer or ancestor), with

Use total_pages as the fallback for the last section. Check the Conclusion in the Attention paper: start_page=10, end_page=15. Since the document is only 15 pages long, the final section extends to the very end. By convention, there is a one-page overlap between sections (a section’s end_page matches the next section’s start_page, rather than being successor.start_page - 1). This design choice means the generation brick’s next-page peek — a powerful completeness check that catches truncated lists at section boundaries — requires just a single lookup instead of a runtime scan.

About the start_y column: Each bookmark in a PDF outline includes a destination Point(x, y) on its target page, not merely a page number. The build_toc_df function exposes this y-coordinate as start_y (the raw value returned by fitz). It anchors each section header to a precise vertical position within start_page, enabling line-level resolution: the same (target_page, target_y) → line join used for native links described in section 1.6. Note the coordinate-orientation caveat: a value of 720 in the Attention paper (LaTeX, bottom-up origin) and 72 in NIST CSF (Acrobat, top-down origin) both refer to the top of the page, just measured from opposite ends. We store the raw value; callers normalize it when they need to land on a specific line.

start_page and end_page serve as page-level anchors. Line-level anchors (start_line, end_line) are the natural next step: they allow downstream stages to pinpoint a section to the exact line in line_df, and they enable TOC offset detection when front matter has been inserted after the TOC was generated (causing the entire TOC to shift by 1 or 2 pages — a real-world failure scenario). A full exploration of this topic is covered in a dedicated bonus article on TOC anchoring and validation; for now, toc_df remains at page-level granularity (with start_y included as a bonus column for callers ready to resolve down to a line).

Its role: toc_df is the most cost-effective semantic signal in the whole pipeline. Each entry names a section: knowing that lines 100–150 belong to “3.5 Positional Encoding” tells the retriever and the LLM what those lines are about before any embedding is computed. Embeddings provide topical proximity; the TOC provides the document’s own structural meaning for each region — declared by the author, not inferred. The breadcrumb extends this with hierarchical context: a chunk is labeled with “Methods > 3.5 Positional Encoding”, giving the language model section-level grounding without bloating the chunk text. end_page is what allows the generation brick to peek one page beyond a retrieved section and detect truncated answers without needing a vision pass. When the document has a native TOC, all of this comes at no extra cost.

Watch out: TOC entries can reference pages that don’t exist (due to a corrupt or truncated export). Always validate 0 <= page_num < n_pages before recording a row, or a section anchor will point nowhere and the page-range join from section 2 will silently return empty results.

1.2. line_df: line granularity

The source of truth for text content. Every line in the PDF, along with its position and dominant typographic style.

*one row per text line with bbox, typography, render mode, `column_position` – Image by author*

How to build it: fitz_pdf_to_line_df(pdf_path) iterates through every text block on every page and emits one row per line. assign_column_positions(line_df) then tags each row with single / left / right / multi. Run it on data/paper/1706.03762v7.pdf, the Attention Is All You Need paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, as stated on the arXiv abstract page). Here is page 4 of the paper (the two-column Figure 2 area):

*rows 1–2 are Figure 2’s twin captions in opposite columns – Image by author*

Its role: line_df is the unified per-element manifest of the document. Text lines come first, but the same row structure also carries image placeholders and table placeholders: every visible content element on a page is represented as one row, with its own bbox, column_position, and a content_type flag (text, image, table). Text-specific fields (font, render_mode) are NaN for non-text rows; the rich image and table metadata lives in image_df and the table extractor’s output, joined back via (page_num, line_num). The result is that a single sorted query against line_df.page_num returns every element on a page in reading order, regardless of type. Downstream stages don’t need to join three separate tables to know what’s on a given page.

Watch out: on multi-GB or thousand-page PDFs, keeping every line (and image) in memory at once becomes a problem. A lightweight mode that skips line_df and image_df for endpoints that only need parsing_summary (classification, the doc-level summary) keeps those endpoints fast; gate the full parse at ingestion time for everything else.

The screenshot below is from Enterprise Document Intelligence, the desktop app I’m building. The Text panel on the right shows line_df made visible: the page’s native text, line by line, parsed once and read directly from the table, displayed alongside the original page it came from.

*line_df made visible: the page’s native text read straight from the table, beside the original page – Image by author*

1.3. page_df: page granularity

Per-page synthesis. Classification, flags, aggregated metrics.

*per-page synthesis: `page_type`, additive flags, char counts, `n_columns` – Image by author*

How to build it: build_page_df(line_df) groups line_df by page_num. detect_columns_per_page(line_df) computes n_columns and the result is merged in.

What else fits here: build_page_df is the natural home for any per-page signal you can aggregate from line_df in the same pass. Beyond the core triplet, simple aggregations land here for free:

n_lines (page density), native_chars versus ocr_chars (a quick scanned-or-native check, no classifier required), n_fonts and font-size spread (a rough structural signal that distinguishes heading-heavy pages from plain text), image_coverage_ratio (joined with image_df). The columns that need a downstream pass are: page_type, generated by classify_page (covered in the previous article), and parsing_method / context_structured, generated by an adaptive cascade that escalates to a heavier parser when fitz alone isn’t sufficient.

Applied to the Attention paper:

*simple aggregations alongside the core triplet on the Attention paper – Image by author*

The role: page_df is the anchor point for extraction. Every parser, every OCR pass, every classifier works page by page; page_df is the table that records what each page contains and how it should be processed. The page is also a meaningful semantic unit in its own right: roughly one or two ideas per page in academic papers, one clause per page in contracts, one sub-topic per page in technical reports. Small enough to stay focused, large enough to carry context. That’s why retrieval typically defaults to page-level chunks in a minimal RAG pipeline and why most downstream coordination is keyed off page_num. When you ask “what is page 5 about,” page_df is the row that answers; when you ask “all scanned pages with poor OCR,” page_df is what you filter against.

Watch out: store page_width and page_height per row, never once per document. Letter and A4 sizes get mixed together in technical publishing, and a landscape page is often inserted for a wide table; a single document-level page size causes every bbox-derived metric (column detection, full-page-image coverage) to drift on the odd-sized pages.

1.4. image_df: image granularity

One row per embedded image.

How to build it: The parser walks every page and calls page.get_image_info(), which returns each embedded image along with its displayed bounding box and intrinsic dimensions. The Attention paper contains three:

*3 images: page 3 Figure 1, page 4 Figure 2’s two panels – Image by author*

Describing the image content: So far image_df only locates each image: a bounding box, a size, a content hash. It says nothing about what the image depicts, and a bounding box is not retrievable. A chart or a diagram holds no extractable text, so OCR and layout-based parsers leave that part empty: to them the region is invisible. To make the figure searchable we run a vision LLM over each extracted image and store a short description alongside the row, for example “a line chart of commodity prices since 2022” or “the Transformer architecture, an encoder of N stacked layers”. That description is text, so retrieval can match against it. A companion piece on vision-LLM enrichment walks through this step in full detail.

*Each extracted image receives a one-sentence description, which is text that retrieval can match against – Image by author*

1.5. object_registry: cross-reference TARGETS

A cross-reference has two sides. The target is where a named object lives in the document: the line “Figure 2: The Transformer model architecture” on page 3, the line “Table 1: BLEU scores” on page 8. The source is a body-text mention pointing at the target: “as shown in Figure 2”, “see Table 1”. object_registry captures the target side, one row per caption. The next subsection (section 1.6) captures the source side. Resolving sources to target pages, so that a retrieved chunk mentioning “see Table 1” also pulls in the page where Table 1 lives, is a follow-up cross-reference pass that consumes both tables.

*captions for named objects, one row per target, `(object_type, object_id)` is the join key – Image by author*

How to build it: Detection uses regex patterns ANCHORED at the start of a line (a real caption starts there, a body-text mention does not); build_object_registry walks line_df, matches each line against the patterns, and keeps the first hit for every (object_type, object_id) pair. On the Attention paper:

OBJECT_PATTERNS = [
    (re.compile(r"^s*(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"^s*Tables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"^s*(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_object_registry(line_df: pd.DataFrame) -> pd.DataFrame:
    """Returns one row per (object_type, object_id), first match wins."""

Applied to the Attention paper, the builder lands one row per named object, with the caption line as the anchor:

*5 figures and 4 tables on the Attention paper, each with its caption anchor – Image by author*

1.6. cross_ref_df: cross-reference SOURCES

The symmetric counterpart of object_registry. Each row is one body-text mention of a named object: “as shown in Figure 2” on page 4, “refer to Table 1” on page 7, “see Annex B for details” on page 12. Every such mention is a source that, when resolved, jumps to a page recorded in object_registry.

Same pattern as the TOC, two methods can produce these rows: native PDF links (the deterministic source, when the document carries them) and text-pattern matching on line_df (the general fallback, what build_cross_ref_df ships). Method 1 is exact but partial. Method 2

This approach is approximate but delivers complete results.

Method 1: Built-in PDF Links

PDF files can include their own clickable cross-references. The function fitz.Page.get_links() retrieves one entry for each link area, where the destination is represented as a (target_page, to.x, to.y) triplet for internal links, or as a URI for external links:

import fitz
doc = fitz.open("data/nist/NIST.CSWP.29.pdf")
for page in doc:
    for ln in page.get_links():
        tgt_page = ln.get("page")
        tgt_pt   = ln.get("to")        # Point(x, y) on the target page
        print(page.number + 1, ln.get("kind"), tgt_page, tgt_pt, ln.get("uri"))

The key detail here is to.y. Simply knowing the target page reveals where in the document the link leads, but not what specific content it points to; the y-coordinate identifies the exact line on that page. We break the destination into two separate columns — tgt_page and tgt_y — and determine the target line by locating the row in line_df whose y0 value is nearest to tgt_y on tgt_page.

Two practical issues to keep in mind:

PDF creators handle y-axis orientation differently. LaTeX uses a bottom-up coordinate system, while Acrobat uses top-down. The normalizer tests both orientations and retains the one that produces the closer match.
The tgt_y value may fall between two lines. In that case, we round to the closest one.

The benefit is clear: once the landing line is identified, we can join (target_page, landing_text) with toc_df to directly retrieve the section index. There is no need for regex or text matching against breadcrumb trails. The built-in link pinpoints exactly which toc_idx was reached.

*Four NIST TOC links pointing at section headers, joinable to `toc_df` – Image by author*

Applying the same pipeline to the Attention paper reveals a different type of link: citations that point to bibliography entries rather than TOC section headers.

*Three Attention-paper citations resolved to bibliography lines via `landing_text` – Image by author*

Coverage is the limiting factor. Both demo PDFs exhibit the same behavior:

Attention paper: 95 internal links, all of which are citations pointing to bibliography entries, along with 18 external URIs (GitHub, arXiv). There are zero built-in links for body-text references such as “as shown in Figure 2”.
NIST Cybersecurity Framework 2.0 (CSWP-29; US Government work, public domain in the US — see NIST copyright statement): 47 internal links, covering all TOC entries and the list of figures that point to section headers, plus 56 external URIs. The situation is the same: no body-text references to figures or tables are hyperlinked.

Enterprise documents tend to be even less cooperative, often containing no built-in links at all (scanned pages, screenshots, or exports from tools that strip link metadata). So while built-in links are a highly reliable signal when available — deterministic and resolvable to a toc_idx when the target is a section header — they never account for the full range of cross-references present in a document.

Method 2: Text-Pattern Matching

Detection relies on the same vocabulary used in OBJECT_PATTERNS, but the regex is applied without anchors so it can match anywhere within a line. Caption lines are excluded to ensure that the line that defines Figure 2 is not also counted as a reference to it.

*One row per body-text mention, joinable back to `object_registry` – Image by author*

On the Attention paper:

REFERENCE_PATTERNS = [
    (re.compile(r"b(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"bTables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"b(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_cross_ref_df(line_df: pd.DataFrame) -> pd.DataFrame:
    """One row per body-text mention, with roughly 30 characters of surrounding context."""

When run on the Attention paper, every body-text reference to a figure or table is captured as a row that can be joined back to object_registry:

*Six of thirteen mentions; Figure 2 is referenced three times across pages 4 and 5 – Image by author*

Across the demo PDFs, the Attention paper contains 13 body-text mentions covering 6 unique objects (Figure 1, Figure 2, Tables 1–4). Some figures are referenced more than once, which is precisely the kind of information the source-side table is designed to capture.

NIST CSF 2.0 has 13 mentions (7 figure references, 5 annex references, 1 table reference) covering 10 unique objects (5 figures, 4 annexes, 1 table). The discrepancy with NIST’s object_registry (6 figures + 3 annexes + 2 tables) is meaningful:

One annex is referenced in the body text but lacks an anchored caption in the document (the regex picks up a reference whose target exists outside the parsed content).
One registered figure and one registered table are never referenced at all.

Both scenarios represent real-world signals that are valuable to surface for any downstream cross-reference resolution process.

1.7. span_df: Sub-Line Granularity (Optional)

Sometimes a full line is too coarse a unit. A single line might combine bold and regular text — for example, a defined term in a contract. A line in a research paper might mix an inline equation in italic with surrounding prose. A line in an amendment might show the original text in black alongside the modification in red.

class Span(BaseModel):
    # Identity & ordering
    pdf_hash: str
    page_num: int
    line_num: int
    span_id:  int

    # Content and position
    text: str
    bbox: tuple[float, float, float, float]

    # Typography signals
    font_name: str
    font_size: float
    is_bold:   bool
    is_italic: bool
    color_rgb: tuple[int, int, int]

A span_df offers finer granularity than line_df. On the Attention paper, the count comes to 3,480 spans compared to 1,048 lines — roughly 3.3 times more records. This overhead is only justified for stages that need to examine typographic details:

Heading detection: A line set in a larger font size, possibly

Headings: Headings are usually bold and maybe also larger or a separate font; detecting bold spans is a reliable way to spot them.
TOC reconstruction: A good Table of Contents can be rebuilt just by finding bold spans. If the PDF doesn’t have built-in bookmarks, this TOC reconstruction step is used to map out the document’s structure.
List item detection: Often, a bold text snippet at the start of a paragraph signals a list item or enumeration.
Defined terms in legal docs: Bold or italic words in contracts are frequently formally defined somewhere in the document. Identifying them during parsing lets you build glossary links later.

How to build it: The default behavior is that parse_pdf(...) returns an empty span_df. If a downstream part of your pipeline needs it, you call a separate builder function for it:


paper = parse_pdf(paper_pdf)
paper["span_df"] = build_span_df(paper_pdf)   # 3,480 rows on the Attention paper

Making the span data an explicit call keeps it optional, so you don’t pay the cost on every parse for steps that only need line_df. Running it on the Attention paper:

*rows 1–5 are body text; rows 6–7 show a section header in bold. The `is_bold` flag is what the TOC reconstructor relies on – Image by author*

1.8. parsing_summary: technical synthesis

A single JSON dictionary per document. At a glance, it answers: “Is this PDF scanned?”, “Does it need OCR?”, “What extraction strategy should the next step use?” It also adds a layer of meaning the downstream stages read: “What type of document is this, and what is it about?”

The dictionary is organized into five sections. The first four are deterministic, built by the parser without calling any AI model. The fifth, the semantic section, contains the document type plus a short AI-generated summary that gets plugged into the question parser’s system prompt.

{
  "pdf_hash": "abc123...",
  "n_pages": 87,
  "pdf_version": "1.7",
  "source_software": "word_export",
  "creator_raw": "Microsoft Word 2019",
  "producer_raw": "Microsoft Word for Microsoft 365",
  "content_type": "scanned_with_ocr",
  "is_scanned": true,
  "has_text_layer": true,
  "ocr_quality": "good",
  "page_type_counts": {"scanned_ocr_good": 80, "native": 5, "empty": 2},
  "scanned_page_ratio": 0.92,
  "has_toc": true,
  "n_toc_entries": 24,
  "n_named_objects": 11,
  "is_encrypted": false,
  "has_form_fields": false,
  "recommended_strategy": "use_existing_ocr",
  "needs_reocr": false,
  "pages_needing_ocr": [],
  "doc_type": "annual_report",
  "typical_fields": ["fiscal_year", "revenue", "net_income", "auditor"],
  "summary": "87-page annual report for fiscal year 2023. Covers revenue, net income, and auditor's notes across operating segments. Standard sections: Letter to Shareholders, MD&A, Financial Statements, Notes."
}

The difference between source_software (read from metadata) and content_type (inferred from actual content) is important. They can disagree: a PDF whose Producer says “Microsoft Word” but whose pages are entirely scanned images means someone likely embedded images into a Word file and exported. That mismatch is useful information—don’t replace one with the other.

The semantic section follows a similar idea on a different level. doc_type is a broad category—resume, contract, academic_paper, invoice, memo, annual_report, and so on—guessed from the filename and the text on the first page. No AI needed for this. typical_fields is a table of field names tied to each document type—the things a question about that document is most likely to ask about. A resume gets [name, email, phone, experience, ...], a contract gets [policyholder, premium, deductible, ...]. The summary field is the only one produced by an AI model: three or four plain sentences naming the document type, its main topic, and the key fields it contains. One AI call at parse time, cached for good, and injected into the question parser’s system prompt so that “what is the name?” on a CV doesn’t just come back as not found. The companion article on what to read before any line gets a number (“Beyond extract_text”) covers the full design of that summary.

2. The relational model: how the tables link

Creating the tables is one challenge; connecting them is another. Once the tables are in place, the common keys shared across them turn eight separate DataFrames into a single queryable model, and almost every connection flows back to line_df, the per-line source of truth.

*How the tables join together: line_df sits at the center, with each table linked through its shared key – Image by author*

A handful of links carry most of the weight:

toc_df → line_df. A Table of Contents entry stores its start_page (and start_y), so from any section you can jump directly to its lines. A request like “Summarize section 3.5” becomes a simple page-range filter on line_df, with no search needed.
image_df ↔︎ line_df. An image has a position on the page, which corresponds to a line slot in line_df. That line’s text is initially blank, since an image holds no extractable text. Optionally, a vision pass analyzes the image and writes a brief description into that text cell, allowing retrieval to match phrases like “the architecture diagram” later. This link makes that enrichment incremental—fill it in when it’s needed, leave it empty otherwise.
cross_ref_df → its target. A mention in the body text resolves to wherever the target lives. “see Figure 2” resolves to object_registry on (ref_type, ref_id); “see section 2.3” resolves to a toc_df entry. The table is populated as references are matched, so resolution works on demand, one mention at a time.
page_df, span_df, object_registry all link back to line_df through page_num or (page_num, line_num), the same join every downstream component relies on.

In practice, common questions collapse into one or two simple filters:

“Summarize section 3.5.” Find its start_page and end_page in toc_df, then line_df[line_df.page_num.between(start, end)]. No embeddings, no keyword search—just the lines in that section.
“What are the totals?” on the invoice from section 3.2 → line_df[line_df.column_position == "right"]. The column the parser identified becomes your query.
“What does Figure 2 show?” object_registry resolves the caption to its page and line;returns the caption text; and if a vision pass has populated the image’s slot, you also receive the description.
“Where is Table 1 referenced?” cross_ref_df[(cross_ref_df.ref_type == "table") & (cross_ref_df.ref_id == 1)] enumerates every mention along with its (page_num, line_num), joined back to toc_df to identify the section name where each mention resides.

Each operation is simply a filter or a join performed on tables already loaded in memory — never a re-parse.

This is the real payoff of those joins downstream. Retrieval fetches a section from toc_df, expands it into its constituent lines via line_df, and broadens the context to the figures it references through object_registry; the generation stage reads those lines; highlighting maps citations back onto the page using (page_num, line_num). The entire pipeline becomes an inexpensive chain of joins built on a single parse, rather than re-reading the PDF at every step. How these joins translate into concrete SQL primary keys, foreign keys, and indexes is a concern for the storage layer, which lies beyond the scope of this article.

3. parse_pdf on two real PDFs, side by side

parse_pdf serves as the single entry point that invokes every helper described above and returns the complete set of linked tables in one call. When you run it on two very different PDFs, the output structure remains identical: same keys, comparable shapes.

3.1. parse_pdf side-by-side on two real PDFs

Executing both calls and placing the two returned dicts side by side reveals that the keys hold up consistently, with per-cell tallies that mirror each document’s unique characteristics:

*same keys, same shape across documents, with per-cell tallies – Image by author*

A LaTeX research paper and the NIST Cybersecurity Framework 2.0 (CSWP-29, US government work, public domain). Two markedly different documents: one spans 15 pages of mathematical notation in a NeurIPS-style two-column layout, the other contains 32 pages of policy text blending single and two-column sections. Same parse_pdf invocation, same keys, every column directly comparable. The Attention paper delivers an unexpected insight along the way: this arXiv version carries 22 native TOC entries, contradicting the widespread assumption that arXiv strips bookmarks.

The PDF is opened once with fitz, every helper operates on the same document state, and the file is closed before the function returns. No reopening, no re-downloading from S3, no risk of two helpers seeing different page versions. From this point forward, retrieval, generation, and annotation never touch the PDF again. They query the dict.

3.2. column_position in action (an invoice)

Invoices are the textbook use case for column_position: line items extend down the left column (descriptions), while prices and totals are stacked down the right column. We use a one-page fictional invoice (data/invoices/invoice_01.pdf, openly licensed, generated for this series) so the layout is a genuine two-column billing format rather than a research paper’s figure caption.

*each line boxed by the column the parser assigned: blue = left (descriptions), green = right (amounts and totals) – Image by author*

Examine the source page first. Each line is enclosed in a box colored according to the column the parser assigned it: blue for the left side (descriptions), green for the right side (amounts and totals). assign_column_positions detects that split cleanly:

*header line on the left at x0 ≈ 54, totals stack on the right at x0 ≈ 391-514, a line item splits a description on the left and quantity + price on the right at the same y0 – Image by author*

The header line sits in the left column at x0 = 54. Below the items table, the totals are stacked on the right: “TOTAL DUE:” at x0 ≈ 391, the amount $2,027.56 at x0 ≈ 497. The line item at y0 = 397.13 illustrates the split clearly: the description “Staff training” sits at x0 = 54 (left), the quantity 0.5 and unit price $197.58 sit at x0 ≈ 343 and x0 ≈ 395 (right). Downstream, asking for “the totals” becomes a single-line query against line_df: line_df[line_df["column_position"] == "right"].

No vision pass, no bounding-box arithmetic. Just a column filter on a structured table.

3.3. Two PDFs, same parser, same shape

Two very different documents, the same parser, directly comparable structured outputs:

*same parser, two PDFs; lines, columns, TOC entries, named objects all queryable – Image by author*

What this would have looked like with a naive get_text() parser: a single string per document, no way to distinguish OCR’d text from native text, no knowledge of where each figure caption is located, no separation between the left and right halves of a two-column page. The retrieval and generation stages would have been built on an unstable foundation.

4. Save once, reload forever

Parsing is the most expensive component in the pipeline. Question parsing, retrieval, and generation each require one LLM call; parsing reads bytes and resolves layout. With PyMuPDF it remains fast (sub-second on a small paper). With heavier engines (Azure Layout, Tesseract, vision-LLM fallback), the same PDF can take 30 seconds to several minutes per run. Three iterations on a downstream prompt means three separate OCR runs. There is no reason for that.

The solution is path-driven. Each PDF writes its parsed tables to a mirror folder under the output directory, matching the source path exactly. From the PDF path alone, every downstream step (retrieval, generation, annotation) knows where the cache is located.

*every PDF in `data/` has a twin folder in `output/` carrying its parsed tables – Image by author*

The relational tables are saved to .xlsx (one file per table, opens with a double-click), and parsing_summary is stored as JSON. Excel is sufficient at this stage: pandas round-trips cleanly, and each table remains inspectable in any spreadsheet application. A production storage layer would swap in SQLite (foreign keys, joins across documents, append-on-update),

However, the downstream components work with DataFrames regardless.

The save_parsed function stores the data in a folder, while load_parsed retrieves the same dictionary—or returns None if no cached version exists. The typical usage boils down to just a few lines:


parsed = load_parsed(pdf_path)
if parsed is None:
    parsed = parse_pdf(pdf_path)
    save_parsed(pdf_path, parsed)

The same caching logic applies throughout the pipeline. Question parsing saves its ParsedQuestion object to questions//parsed_question.json, retrieval outputs retrieved_pages.xlsx, and generation stores answer.json. Each stage can be restored from disk and re-executed without re-invoking the LLM. For example, if you adjust a generation prompt, parsing and retrieval don’t need to run again—saving both time and cost.

5. Conclusion

An effective RAG parser doesn’t just pull text from a document—it transforms an unstructured PDF into a structured, relational representation. This means creating interconnected tables linked by shared keys like page_num, line_num, and (ref_type, ref_id), where each table holds a distinct type of information. Once this structure is built, downstream tasks like retrieval, answer generation, and annotation never go back to the original PDF; they operate entirely on DataFrames. By parsing once and reusing that result indefinitely, you convert a per-question delay of ~30 seconds into a one-time cost per document.

The output isn’t a flat string—it’s a set of relational tables. Every tool connected to the parser—whether for keyword search, semantic similarity, section lookup, citation formatting, audit logging, or change tracking—reads from these tables instead of raw PDF bytes. The PDF itself is opened only during ingestion. After that, all operations use SQL or pandas. This design justifies the upfront engineering effort: you pay the parsing cost once per file, and every subsequent pipeline iteration works against a consistent, queryable data layer.

This post is part of the Enterprise Document Intelligence series. The minimal RAG pipeline demonstrates how these relational tables function end-to-end on a real-world PDF.

6. Sources and further reading

Earlier in the series:

The parser described here mirrors the architecture of Docling (Auer et al., Docling Technical Report, IBM Research 2024): it uses layout detection, TableFormer for tables, and reading-order modeling. Borderless table extraction relies on the approach from Smock et al. (PubTables-1M / Table Transformer, CVPR 2022). The page classification system builds on foundations from Pfitzmann et al. (DocLayNet, KDD 2022). On top of that, this parser introduces a render-mode detection step (identifying native, scanned, or mixed documents) along with OCR quality scoring. Its output is a relational table set—including line_df, page_df, image_df, toc_df, object_registry, cross_ref_df, span_df, and a parsing_summary dictionary. Downstream stages like retrieval, generation, and annotation never re-access the PDF; they query these DataFrames directly.

Aligned with this article’s approach:

Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). Describes the reference pipeline used here: layout analysis, TableFormer, reading order, and unified document representation.
Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). Vision-based method for detecting and structuring tables—core to most modern table parsers.
Pfitzmann et al., DocLayNet, KDD 2022 (arXiv:2206.01062). Provides empirical benchmarks for page categories and layout detection used in this work.
Lo et al., PaperMage, EMNLP 2023 demos. Illustrates the distinction between parsing for indexing versus parsing for reading—highlighting that retrieval-focused parsing differs from generation-focused parsing.

Alternative perspectives:

Faysse et al., ColPali: Efficient Document Retrieval with Vision Language Models, 2024 (arXiv:2407.01449). Uses page images directly for retrieval via vision-language models, skipping the table-parsing step entirely. In contrast, this article relies on bounding-box-anchored DataFrames as its foundational data structure.
Wang et al., DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding, JPMorgan 2024 (arXiv:2401.00908). Employs an LLM that processes PDFs natively without a dedicated relational parsing component. Shares conceptual ground with ColPali but differs from this article’s emphasis on a queryable relational artifact.
Kim et al., OCR-free Document Understanding Transformer (Donut), ECCV 2022 (arXiv:2111.15664). Presents an end-to-end model that bypasses OCR altogether—a useful point of comparison with the OCR-quality assessment layer added in this article’s render-mode detection phase.

Top Posts

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

MassRobotics Unveils the 2026 Robotics Medal and Rising Star Award Winners

Unlocking the Hidden Connections: Why Relational Shape RAG Demands More Than Flat Text from PDFs

Google’s Gemini-SQL2 Achieves 80.04% on BIRD Leaderboard with Gemini 3.1 Pro

Why Decade-Old Residual Connections Still Dominate AI—And Why That’s Holding Us Back

When PyMuPDF Misses the Table: Unlocking PDF Parsing for RAG with Azure Layout

“Unlock 3 Powerful NumPy Tricks to Supercharge Your Numerical Performance”

Pioneering Otitis Media Diagnosis: The 4DO-DETR Breakthrough

Perplexity Elevates Deep Computer with Research Across 20 Frontier Models for Reports, Decks, and Dashboards

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

MassRobotics Unveils the 2026 Robotics Medal and Rising Star Award Winners

AI-Powered Portfolio Trading: The Future of Automated Investing

Google’s Gemini-SQL2 Achieves 80.04% on BIRD Leaderboard with Gemini 3.1 Pro

OWL’s Take: Who Does Claude Fable Predicted to Win the 2026 FIFA World Cup?

Shadows of Sabotage: Unmasking Supply-Chain Threats Lurking in the Dark Web

Bridging the Execution Gap: Why Human Talent Is the Missing Link in Modern Government Tech

Trending

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Unlocking the Hidden Connections: Why Relational Shape RAG Demands More Than Flat Text from PDFs

1. One table per entity

1.1. toc_df: table of contents

1.2. line_df: line granularity

1.3. page_df: page granularity

1.4. image_df: image granularity

1.5. object_registry: cross-reference TARGETS

1.6. cross_ref_df: cross-reference SOURCES

1.7. span_df: Sub-Line Granularity (Optional)

1.8. parsing_summary: technical synthesis

2. The relational model: how the tables link

3. parse_pdf on two real PDFs, side by side

3.1. parse_pdf side-by-side on two real PDFs

3.2. column_position in action (an invoice)

3.3. Two PDFs, same parser, same shape

4. Save once, reload forever

5. Conclusion

6. Sources and further reading

Related Posts