This continuation of the document parsing series within Enterprise Document Intelligence focuses on building an enterprise RAG system from four foundational components. It expands upon Article 5 (document parsing) by examining a single table: toc_df, which captures the document’s organizational structure. Article 5 populates this table using the PDF’s built-in outline (via PyMuPDF’s doc.get_toc) when available. This installment addresses the scenario where no built-in outline exists, showing how to reconstruct that structure from the visible content on the page.
Take NIST FIPS 202, the SHA-3 standard (a US Government publication in the public domain; refer to the NIST copyright notice), and go to page seven. You’ll find a well-formatted table of contents with section titles aligned to the left and corresponding page numbers on the right. Now open the identical file in any PDF viewer and check the bookmarks panel. It’s blank. The content page exists as printed ink on paper, not as structured data the software can interpret. The document author created a proper table of contents, but the file doesn’t expose it in a machine-readable way.
Article 5 (document parsing) and Article 5B (the relational data model) relied on doc.get_toc(), the PDF’s built-in outline, to populate toc_df. When present, it’s precise. But it frequently isn’t available. Many real-world documents, academic papers exported directly from LaTeX, contracts converted to PDF, and government standards include a printed table of contents yet lack an internal outline. In those instances, toc_df returns empty, even though the document clearly displays its structure right there on page seven.
That structural information isn’t merely decorative. Retrieval operates on a per-section basis (Article 7). The chunker splits along heading boundaries (Article 5B). Summarization processes the document section by section. Each of these stages depends on toc_df. When the table is empty, retrieval reverts to scanning every page, the chunker divides content based on arbitrary page breaks, and the resulting answers lose the document’s inherent organization. So the question addressed here is both focused and practical: when a file contains no outline but does include a printed contents page, how do you convert that page back into a functional toc_df?
One clarification right away, since the distinction is easily blurred. This discussion applies to documents that already have a contents page. A document with no contents page whatsoever, a paper that begins directly with “1. Introduction”, a brief five-page memo, or an export that stripped all headings entirely presents a fundamentally different challenge. Recovering structure from within the body of an unstructured document is a summarization task, a separate objective that constructs the map from individual chunks instead of reading it directly from a page. Our focus here is strictly on interpreting an existing contents page.
1. Two phases: extract the entries, then locate their actual pages
It’s useful to distinguish between two distinct pieces of information that a contents page provides. First, it gives you an ordered list of sections with titles and their hierarchical relationships: what the document covers and in what sequence. Second, it provides a mapping from each section to its physical starting location within the file. The built-in outline delivers both automatically. Reading a printed contents page, however, gives you the first directly, while the second arrives only as printed references, which aren’t physical page numbers. These two phases have different potential points of failure, so this article treats them separately: first extract the entries, then align them to their physical locations.
Input: a PDF where
doc.get_toc()returns nothing but a contents page is printed within. Output: atoc_dfmatching the schema defined in Article 5B (level,title,start_page,end_page,breadcrumb), ensuring all downstream processes continue functioning as expected.
Contents pages come in two varieties, each requiring different effort to interpret.
2. Three scenarios, ranked by complexity

Each scenario involves both a detection phase and an extraction phase, falling through to the next when detection fails or yields insufficient results.
- Scenario 1: native outline. Already handled by the
build_toc_dffunction from Article 5. Free, precise, and hierarchically complete. When this succeeds, no additional work is needed. We mention it here only to establish the cost baseline. - Scenario 2: contents page with hyperlinks. No outline exists, but an early page lists section titles as internal hyperlinks pointing to locations within the document. Because each link already specifies its physical target page, alignment becomes unnecessary.
- Scenario 3: contents page without hyperlinks. A page formatted like a traditional printed table of contents (titles, dot leaders, right-aligned page numbers) but containing no links. The page numbers shown are the document’s own logical numbering rather than physical page indices, so this scenario requires the alignment phase.
All of this logic resides in a dedicated module, kept separate from the native-outline path to maintain the clarity of Article 5. The entry point is reconstruct_toc_df.
3. Follow the hyperlinks
Scenario 2 is the favorable case. Some documents lack an outline but do include a clickable contents page. The NIST Cybersecurity Framework is one example: page two presents every section as a hyperlink navigating into the document. PyMuPDF exposes these links on a per-page basis, and each internal link carries its target page directly.
Input: the PDF (links are not captured in
line_df, so this reader opens the file directly). Output: entries with a title and the physical target page, already resolved.
Detection relies on a density check: a page containing five or more internal links is a navigation page, not a body document page with an occasional footnote link. The extraction process connects each link’s clickable region back to the text it covers, then removes the decorative leaders and the trailing page label.
import fitz # PyMuPDF
def extract_toc_from_links(pdf_path, min_links=5):
"""The contents page is identified as the page with the highest number of internal links."""
doc = fitz.open(pdf_path)
best = []
for page in doc:
entries = []
for link in page.get_links():
if link["kind"] != fitz.LINK_GOTO: # only internal navigation links
continue
label = clean(text_under_rect(page, link["from"]))
if label:
entries.append({"title": label,
"start_page": link["page"] + 1, # destination page
"level": 1})
if len(entries) >= min_links and len(entries) > len(best):
best = entries4. Read the printed contents page, then find its real pages
Case 3 is the most common scenario: a printed table of contents with no clickable links behind it. You’ll see a page titled “Contents” or “Table of contents,” a column of section titles, a column of page numbers, and often dot leaders connecting them. FIPS 202 is a perfect example. A human can parse it instantly. For a program, there are two separate steps involved, and it’s the second one that most people overlook.
4.1 Detecting and reading the contents page
The first step is locating the contents page. The real giveaway that distinguishes a contents page from regular text is the density of dot leaders — multiple lines that look like Some title .......... 42. Spotting the word “contents” boosts confidence but isn’t strictly necessary, and by itself it’s unreliable (a regular sentence might mention “table of contents”). The reader operates on line_df alone, making it independent of any particular PDF engine.
Input:
line_df. Output: entries containing a title and adisplayed_page, which is the page number exactly as it appears printed on the line.
import re
# "Introduction ......... 12" "Introduction 12"
DOTTED = re.compile(r"^(.*?S)[.…](?:[.…s]){2,}(d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?S)s{2,}(d{1,3})$")
def extract_toc_from_contents(line_df):
entries = []
for page in find_contents_pages(line_df): # pages dense in dot leaders
for line in lines_of(line_df, page):
m = DOTTED.match(line) or TRAILING.match(line)
if m:
title, label = m.group(1).strip(), int(m.group(2))
entries.append({"title": title,
"displayed_page": label, # printed label
"level": infer_level(title)}) # "2.3.1" -> 3
return entries4.2 The label is not the page
Here’s where it gets tricky. The contents page says Introduction .... 1. But page 1 of the file is the cover, not the introduction. A front matter section — cover, foreword, and the contents page itself — sits ahead of the main body, so the printed label and the actual physical page exist in different numbering systems. If you jump to the physical page matching the label, you’ll land several pages too early, every single time.
So a printed page number is just a label, stored in displayed_page. Converting it to the physical start_page is a separate step. The simple approach assumes a single constant offset: physical = displayed + shift. To determine the shift, grab a small sample of titles, test every reasonable offset, and keep the one where the most titles actually show up on their predicted page.
def infer_page_shift(line_df, entries, max_shift=40):
"""Best constant offset: physical_page = displayed_label + shift."""
page_text = {p: text_of(line_df, p) for p in pages(line_df)}
sample = [(e["displayed_page"], norm(e["title"])) for e in entries][:20]
best_shift, best_score = 0, -1
for shift in range(-max_shift, max_shift + 1):
hits = sum(1 for label, title in sample
if title in page_text.get(label + shift, ""))
if hits > best_score: # most titles land where predicted
best_score, best_shift = hits, shift
return best_shift
The same principle applies to a real document. FIPS 202 prints its contents page on physical pages 7 and 8, and its body numbering begins well after the front matter. Running the detection and alignment on it yields an inferred shift of +8: the introduction that the contents page labels as page 1 actually begins on physical page 9.

Placed side by side with the page it read, the two columns tell the full story. The label column reproduces what the contents page prints; the page column shows where each section actually starts in the file.

A constant shift handles the typical case. When numbering restarts midway through — an appendix that resets to 1, or inserted plates — the offset isn’t constant anymore. The fallback is content matching: find each title’s real page by fuzzy-matching its text against the document body, ensuring the pages stay in monotonically non-decreasing order. align_toc_df tries the shift method first and falls back to content matching, so Case 3 delivers the same physical start_page downstream as Case 2.
When the printed contents page is too irregular for these patterns — a two-column layout, titles that wrap across lines, or leaders rendered as uneven whitespace — the LLM extractor steps in with a typed schema, reading the first few pages and returning the same entry structure. This is a last-resort tool for this case, not the default approach, because a clean printed contents page is inexpensive to parse and the LLM is not. Even then, the LLM only reads the contents page; it never fabricates a structure for the document.
5. The LLM judges, it doesn’t detect
Both detection approaches are heuristics, and heuristics make mistakes: a link rectangle that swept up two titles, a contents line the patterns split wrong, a numbering that looks off. The reflex with an LLM is to hand it the entire document and ask for a TOC. That is the expensive, least auditable option. A better division of labour has the heuristic propose a TOC, and the LLM only checks whether it holds together.
from pydantic import BaseModelclass TocCoherenceVerdict(BaseModel): # typed structured output
is_coherent: bool
issues: list[str]
SYSTEM = ("A heuristic already proposed this TOC. Do NOT detect structure. "
"Judge only: is the numbering consistent (no unexplained skips), "
"are the page numbers non-decreasing, does the hierarchy form a "
"sensible tree?")
def check_toc_coherence(toc_df):
view = "n".join(f"[{r.start_page}] {' ' * (r.l



