**Ghost TOC, Found: Reconstructing A Missing PDF's Structure For Precision RAG Retrieval**

This continuation of the document parsing series within Enterprise Document Intelligence focuses on building an enterprise RAG system from four foundational components. It expands upon Article 5 (document parsing) by examining a single table: toc_df, which captures the document’s organizational structure. Article 5 populates this table using the PDF’s built-in outline (via PyMuPDF’s doc.get_toc) when available. This installment addresses the scenario where no built-in outline exists, showing how to reconstruct that structure from the visible content on the page.

The placement of this companion piece: it builds on Article 5 (document parsing), within Part II (the four components), recreating the table of contents when the PDF includes none – Image by author

Take NIST FIPS 202, the SHA-3 standard (a US Government publication in the public domain; refer to the NIST copyright notice), and go to page seven. You’ll find a well-formatted table of contents with section titles aligned to the left and corresponding page numbers on the right. Now open the identical file in any PDF viewer and check the bookmarks panel. It’s blank. The content page exists as printed ink on paper, not as structured data the software can interpret. The document author created a proper table of contents, but the file doesn’t expose it in a machine-readable way.

Article 5 (document parsing) and Article 5B (the relational data model) relied on doc.get_toc(), the PDF’s built-in outline, to populate toc_df. When present, it’s precise. But it frequently isn’t available. Many real-world documents, academic papers exported directly from LaTeX, contracts converted to PDF, and government standards include a printed table of contents yet lack an internal outline. In those instances, toc_df returns empty, even though the document clearly displays its structure right there on page seven.

That structural information isn’t merely decorative. Retrieval operates on a per-section basis (Article 7). The chunker splits along heading boundaries (Article 5B). Summarization processes the document section by section. Each of these stages depends on toc_df. When the table is empty, retrieval reverts to scanning every page, the chunker divides content based on arbitrary page breaks, and the resulting answers lose the document’s inherent organization. So the question addressed here is both focused and practical: when a file contains no outline but does include a printed contents page, how do you convert that page back into a functional toc_df?

One clarification right away, since the distinction is easily blurred. This discussion applies to documents that already have a contents page. A document with no contents page whatsoever, a paper that begins directly with “1. Introduction”, a brief five-page memo, or an export that stripped all headings entirely presents a fundamentally different challenge. Recovering structure from within the body of an unstructured document is a summarization task, a separate objective that constructs the map from individual chunks instead of reading it directly from a page. Our focus here is strictly on interpreting an existing contents page.

1. Two phases: extract the entries, then locate their actual pages

It’s useful to distinguish between two distinct pieces of information that a contents page provides. First, it gives you an ordered list of sections with titles and their hierarchical relationships: what the document covers and in what sequence. Second, it provides a mapping from each section to its physical starting location within the file. The built-in outline delivers both automatically. Reading a printed contents page, however, gives you the first directly, while the second arrives only as printed references, which aren’t physical page numbers. These two phases have different potential points of failure, so this article treats them separately: first extract the entries, then align them to their physical locations.

Input: a PDF where doc.get_toc() returns nothing but a contents page is printed within. Output: a toc_df matching the schema defined in Article 5B (level, title, start_page, end_page, breadcrumb), ensuring all downstream processes continue functioning as expected.

Contents pages come in two varieties, each requiring different effort to interpret.

2. Three scenarios, ranked by complexity

*The cascade attempts each scenario sequentially and stops at the first one that produces a usable table of contents. Image by author.*

Each scenario involves both a detection phase and an extraction phase, falling through to the next when detection fails or yields insufficient results.

Scenario 1: native outline. Already handled by the build_toc_df function from Article 5. Free, precise, and hierarchically complete. When this succeeds, no additional work is needed. We mention it here only to establish the cost baseline.
Scenario 2: contents page with hyperlinks. No outline exists, but an early page lists section titles as internal hyperlinks pointing to locations within the document. Because each link already specifies its physical target page, alignment becomes unnecessary.
Scenario 3: contents page without hyperlinks. A page formatted like a traditional printed table of contents (titles, dot leaders, right-aligned page numbers) but containing no links. The page numbers shown are the document’s own logical numbering rather than physical page indices, so this scenario requires the alignment phase.

All of this logic resides in a dedicated module, kept separate from the native-outline path to maintain the clarity of Article 5. The entry point is reconstruct_toc_df.

3. Follow the hyperlinks

Scenario 2 is the favorable case. Some documents lack an outline but do include a clickable contents page. The NIST Cybersecurity Framework is one example: page two presents every section as a hyperlink navigating into the document. PyMuPDF exposes these links on a per-page basis, and each internal link carries its target page directly.

Input: the PDF (links are not captured in line_df, so this reader opens the file directly). Output: entries with a title and the physical target page, already resolved.

Detection relies on a density check: a page containing five or more internal links is a navigation page, not a body document page with an occasional footnote link. The extraction process connects each link’s clickable region back to the text it covers, then removes the decorative leaders and the trailing page label.

import fitz   # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents page is identified as the page with the highest number of internal links."""
    doc = fitz.open(pdf_path)
    best = []
    for page in doc:
        entries = []
        for link in page.get_links():
            if link["kind"] != fitz.LINK_GOTO:        # only internal navigation links
                continue
            label = clean(text_under_rect(page, link["from"]))
            if label:
                entries.append({"title": label,
                                "start_page": link["page"] + 1,  # destination page
                                "level": 1})
        if len(entries) >= min_links and len(entries) > len(best):
            best = entries

4. Read the printed contents page, then find its real pages

Case 3 is the most common scenario: a printed table of contents with no clickable links behind it. You’ll see a page titled “Contents” or “Table of contents,” a column of section titles, a column of page numbers, and often dot leaders connecting them. FIPS 202 is a perfect example. A human can parse it instantly. For a program, there are two separate steps involved, and it’s the second one that most people overlook.

4.1 Detecting and reading the contents page

The first step is locating the contents page. The real giveaway that distinguishes a contents page from regular text is the density of dot leaders — multiple lines that look like Some title .......... 42. Spotting the word “contents” boosts confidence but isn’t strictly necessary, and by itself it’s unreliable (a regular sentence might mention “table of contents”). The reader operates on line_df alone, making it independent of any particular PDF engine.

Input: line_df. Output: entries containing a title and a displayed_page, which is the page number exactly as it appears printed on the line.

import re
# "Introduction ......... 12"             "Introduction       12"
DOTTED   = re.compile(r"^(.*?S)[.…](?:[.…s]){2,}(d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?S)s{2,}(d{1,3})$")

def extract_toc_from_contents(line_df):
    entries = []
    for page in find_contents_pages(line_df):    # pages dense in dot leaders
        for line in lines_of(line_df, page):
            m = DOTTED.match(line) or TRAILING.match(line)
            if m:
                title, label = m.group(1).strip(), int(m.group(2))
                entries.append({"title": title,
                                "displayed_page": label,      # printed label
                                "level": infer_level(title)}) # "2.3.1" -> 3
    return entries

4.2 The label is not the page

Here’s where it gets tricky. The contents page says Introduction .... 1. But page 1 of the file is the cover, not the introduction. A front matter section — cover, foreword, and the contents page itself — sits ahead of the main body, so the printed label and the actual physical page exist in different numbering systems. If you jump to the physical page matching the label, you’ll land several pages too early, every single time.

So a printed page number is just a label, stored in displayed_page. Converting it to the physical start_page is a separate step. The simple approach assumes a single constant offset: physical = displayed + shift. To determine the shift, grab a small sample of titles, test every reasonable offset, and keep the one where the most titles actually show up on their predicted page.

def infer_page_shift(line_df, entries, max_shift=40):
    """Best constant offset: physical_page = displayed_label + shift."""
    page_text = {p: text_of(line_df, p) for p in pages(line_df)}
    sample = [(e["displayed_page"], norm(e["title"])) for e in entries][:20]
    best_shift, best_score = 0, -1
    for shift in range(-max_shift, max_shift + 1):
        hits = sum(1 for label, title in sample
                   if title in page_text.get(label + shift, ""))
        if hits > best_score:              # most titles land where predicted
            best_score, best_shift = hits, shift
    return best_shift

*Printed labels 1, 2, 4, 7 correspond to physical pages 4, 5, 7, 10 once the front-matter offset is determined. Image by author*

The same principle applies to a real document. FIPS 202 prints its contents page on physical pages 7 and 8, and its body numbering begins well after the front matter. Running the detection and alignment on it yields an inferred shift of +8: the introduction that the contents page labels as page 1 actually begins on physical page 9.

*Eight pages of front matter, so every printed label lands eight pages later in the file. Image by author*

Placed side by side with the page it read, the two columns tell the full story. The label column reproduces what the contents page prints; the page column shows where each section actually starts in the file.

*Left, the document’s own contents page; right, what the detector returns — label alongside physical page. Image by author*

A constant shift handles the typical case. When numbering restarts midway through — an appendix that resets to 1, or inserted plates — the offset isn’t constant anymore. The fallback is content matching: find each title’s real page by fuzzy-matching its text against the document body, ensuring the pages stay in monotonically non-decreasing order. align_toc_df tries the shift method first and falls back to content matching, so Case 3 delivers the same physical start_page downstream as Case 2.

When the printed contents page is too irregular for these patterns — a two-column layout, titles that wrap across lines, or leaders rendered as uneven whitespace — the LLM extractor steps in with a typed schema, reading the first few pages and returning the same entry structure. This is a last-resort tool for this case, not the default approach, because a clean printed contents page is inexpensive to parse and the LLM is not. Even then, the LLM only reads the contents page; it never fabricates a structure for the document.

5. The LLM judges, it doesn’t detect

Both detection approaches are heuristics, and heuristics make mistakes: a link rectangle that swept up two titles, a contents line the patterns split wrong, a numbering that looks off. The reflex with an LLM is to hand it the entire document and ask for a TOC. That is the expensive, least auditable option. A better division of labour has the heuristic propose a TOC, and the LLM only checks whether it holds together.

from pydantic import BaseModel
class TocCoherenceVerdict(BaseModel):       # typed structured output

is_coherent: bool

issues: list[str]
SYSTEM = ("A heuristic already proposed this TOC. Do NOT detect structure. "

"Judge only: is the numbering consistent (no unexplained skips), "

"are the page numbers non-decreasing, does the hierarchy form a "

"sensible tree?")
def check_toc_coherence(toc_df):

view = "n".join(f"[{r.start_page}] {'  ' * (r.l

Top Posts

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

Why 5G Private Networks Are Powering the Future of Industrial IoT

Unlocking the Power of Date Tables in Self-Service Environments

Ghost TOC, Found: Reconstructing a Missing PDF’s Structure for Precision RAG Retrieval

`MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode`

`The Hidden Logic Behind AI’s Next Move: How Agents Master the Art of Tool Selection`

`Matrix Recurrent Units Revisited: A Promising Alternative to Attention`

`Crawlee in Python: Architecting an Intelligent Web Crawling Pipeline with Robotic Compliance, Link Graph Mapping, and RAG-Ready Chunk Export`

`Lightning-Fast Lake Views in Microsoft Fabric: When Your Medallion Architecture Fits in a Single SELECT`

`Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration`

`MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode`

`Why 5G Private Networks Are Powering the Future of Industrial IoT`

`Unlocking the Power of Date Tables in Self-Service Environments`

`XRP’s Great Retirement Exposed: The Hidden Math Behind the Hoax`

`How AI Is Rewriting the Rules of Threat Management`

`Ghost TOC, Found: Reconstructing a Missing PDF’s Structure for Precision RAG Retrieval`

`Build Reactivity into Python Dashboards: Prefab UI Components with Seamless Static HTML Export`

`Cardano’s Hoskinson Makes a Bold AI Wager as Midnight City Charges Ahead`

`Trending`

`MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode`

`Why 5G Private Networks Are Powering the Future of Industrial IoT`

`Latest Posts`

`Not More Data, but Better World Models – Unite.AI`

`OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears`

Subscribe to Updates

Top Posts

**Ghost TOC, Found: Reconstructing a Missing PDF’s Structure for Precision RAG Retrieval**

1. Two phases: extract the entries, then locate their actual pages

2. Three scenarios, ranked by complexity

3. Follow the hyperlinks

4. Read the printed contents page, then find its real pages

4.1 Detecting and reading the contents page

4.2 The label is not the page

5. The LLM judges, it doesn’t detect

Related Posts

Ghost TOC, Found: Reconstructing a Missing PDF’s Structure for Precision RAG Retrieval

`Related Posts`