The standard RAG approach doesn’t fit every situation. Article 3 explains that there’s no single, universal RAG method—you need to choose the right one for your specific task. This article acts as a guide to help you make that choice.
Most teams follow the same standard process when building RAG systems: they break documents into chunks, create embeddings for each chunk, store them in a vector database, embed the user’s question, retrieve the most similar chunks using cosine similarity, and pass those results to a large language model. This is the classic RAG playbook. It’s what every tutorial teaches and every demo uses.
However, real-world problems are far more varied than this single approach suggests. Here are a few examples.
Consider three very different scenarios.
Highly structured, high-volume documents. Think of insurance certificates, KYC forms, regulatory filings, or monthly brokerage statements. The same software generates the same layout for every document. A few hundred lines of regex can extract the necessary fields in microseconds. Using the classic RAG playbook here means paying an LLM to do something the document’s layout already provides for free.
This pattern appears across many industries: payroll stubs, bank statements, lab test reports, tax filings, compliance attestations, or supplier invoices from a single ERP system. Whenever one software application generates all the documents, the layout itself becomes a reliable contract.
Detecting sarcasm in customer service transcripts. “Find every sarcastic remark in this month’s call recordings.” Standard sentiment analysis (identifying anger, frustration, joy) is largely handled by sentiment lexicons: words like unacceptable, ridiculous, or frustrated are clear indicators. Sarcasm is the classic exception. “Oh, fantastic service, only had to wait 45 minutes” would score as positive in any lexicon, and its embedding would cluster it with sincere praise because the surface words are nearly identical. The only reliable solution is an LLM that reads each call transcript in full and judges the gap between the literal words and the intended meaning.
This pattern shows up in various functions: HR exit interviews seeking hidden frustration, internal chat archives scanned for cultural red flags before an M&A deal closes, earnings call transcripts analyzed for CFO hedging, or sales call recordings checked for promises not authorized in the contract. In all these cases, tone and intent have no direct textual anchor.
Engineering schematics (a completely different challenge). Drawings, slides where data is embedded in charts, technical specifications with integrated images. A text-only RAG system would return only the caption and miss the actual schematic. Vision models are needed here, and only here.
Similar cases include architectural blueprints, scanned handwritten records, slide decks where the key data is in the charts, lab notebook pages, or medical imaging reports. Whenever the meaning is embedded in the pixels themselves.
The classic RAG playbook is overkill for templated documents (where regex suffices), fundamentally mismatched for call transcripts (where no textual anchor exists), and blind to visual content in schematics (where vision is essential). It works well for a middle range of problems but is often treated as a universal solution. That middle range is real and is detailed in Section 3.3; the cost of misapplying it to other cases is what this article aims to prevent.
This article serves as your diagnostic tool. Follow these three steps, in order.
- Identify the two key axes: RAG problems aren’t monolithic. They exist on a spectrum defined by two axes: how structured your documents are, and how controlled your questions are. Each combination demands a different technical stack.
- Match techniques to each region: Each area of this spectrum has its own appropriate tools: regex, section-based retrieval, hybrid retrieval (combining lexical search with embedding similarity), vision models, or SQL-based aggregation. A third dimension—the agentic dimension, covered in section 2.4—sits on top of these and determines how much runtime control the LLM is given. The catalog later in the article maps each region to its corresponding technique zone.
- Pinpoint your own scenario: Where do your documents fall on the complexity axis? Where do your questions fall on the control axis? The intersection identifies a specific region and the techniques best suited for it.
You don’t need to build everything. You just need to identify where your problem fits, then focus on the relevant parts of this series. Most readers will skip about half of it.
A note before we dive into the technical details. Most enterprise RAG falls into two main categories: extracting fields from templated documents (the regex case mentioned earlier), or answering free-form questions on diverse documents like contracts and reports (where the rest of the series spends most of its time). Conversational transcripts represent a significant third category, common in customer service, HR, and compliance; sarcasm detection is the hardest challenge they present. Purely visual content (schematics, slide decks) and corpus-scale analytical questions (Part IV) arise less frequently. You may encounter one or two of these. The grid below lets you visually locate your scenario at a glance.
This diagnostic is part of a broader framework: Enterprise Document Intelligence Volume 1 builds enterprise RAG step by step, and the regions mapped in this article correspond to the series articles where each technique is developed in detail.
1. Two Axes: Document Complexity and Question Control
Every problem we’ll encounter in this series can be positioned along two axes:
- Document complexity: How consistent is the structure across your documents? Can a parser locate fields by position or heading, or do you need a model that can visually interpret the page?
- Question control: Who formulates the question? An engineer crafting a fixed prompt, or a user typing freely into a chat interface, possibly unsure of what to ask?
These two axes are largely independent. The one exception: a fixed-template document (Tier 1, described below) typically requires engineer-templated questions (Tier A), since the user never types a question directly. Outside that specific corner, any document tier can pair with any question tier.
1.1 Document Axis: From Fixed Template to Vision Model
Volume 1 focuses exclusively on PDF documents. Multi-format documents (Word, Excel, PowerPoint, email) are covered in Volume 2; everything below describes processing one PDF at a time.
Documents vary in structural redundancy: how much of their layout is shared across the entire corpus. Five tiers cover most enterprise scenarios.

Tier 1: Fixed template: Every document shares the same structure, with the same fields in the same locations, often generated by the same software: insurance
Here is a paraphrased version that maintains the original HTML structure, keeps the technical meaning intact, but is easier to read and understand:
—
When documents all come from the same source—such as a single broker’s certificates, KYC forms, tax filings, or internal compliance attestations—their format is highly predictable. So much so that you can reliably target data points by their exact position on the page. *Technique: regex or coordinate-based extraction; no AI model needed.*
**Tier 2: Family of templates:** These documents follow a recognizable layout but with minor variations—like invoices from different suppliers, lease agreements from various landlords, or employment contracts across companies operating under the same legal system. *Technique: use one regex pattern per template, with a few-shot LLM as a backup when the format slightly deviates.*
**Tier 3: Heterogeneous structured:** Each document has its own unique structure (sections, headings, tables of contents), and these structures don’t repeat across files—think custom legal contracts, technical manuals from different vendors, or financial reports. *Technique: parse the document’s internal structure and retrieve content using its own table of contents.*
**Tier 4: Unstructured / OCR-processed:** This includes scanned PDFs, photos of paper documents, emails, or free-form notes—where the text exists but the layout is degraded or missing. *Technique: apply OCR with confidence scoring, then use hybrid retrieval (combining keyword search and semantic embeddings) over the noisy output.*
**Tier 5: Visually rich:** Documents where the *meaning* resides primarily in visuals—such as engineering diagrams, complex data tables rendered as images, slide decks with charts, or detailed schematics. A plain-text extraction will miss critical information. *Technique: use a vision-capable AI model on the page image, often paired with text-based RAG.*
The further down this spectrum you go, the higher the cost per document. The smartest approach is to push every task as high up the tiers as honestly possible. If your team labels the entire corpus “too complex for regex” without first checking for structural repetition, you’re defaulting to the most expensive solution unnecessarily.
### 1.2 Question axis: from fixed prompts to interactive chatbots
Teams often overlook this axis—but it’s crucial. Two questions may look identical in wording but demand entirely different technical stacks. What really matters is *who controls the question and how much flexibility they have*.

**Tier A: Engineer-templated:** The question is hardcoded into the system—for example, “Extract the effective date” or “What is the policy number?” The engineer designs, tunes, and validates the prompt across thousands of documents. End users may not even interact with it directly. *Technique: field extraction with structured output; no question interpretation required.*
**Tier B: User fills predefined slots:** The question uses a template where the user supplies specific values—for instance, “Show me the clause about {topic} in this contract.” The user selects the topic from a dropdown or enters a tag, but the overall query shape stays fixed. *Technique: section retrieval matched against a known category system.*
**Tier C: Free-form user query, answered immediately:** The user types anything they like, and the system responds in one shot—e.g., “Why does this contract differ from last year’s?” This is the classic “chat-with-your-document” scenario, where the system must interpret the query, fetch relevant context, and generate an answer. *Technique: single-document RAG with full question parsing.*
**Tier D: Free query with clarification capability:** Like Tier C, but the system can ask follow-up questions when the query is ambiguous—such as “Which section are you referring to?” or “Did you mean the sub-tenant or the main tenant?” This mirrors real-world chatbot behavior and greatly expands the range of usable questions. *Technique: question parsing combined with an interactive clarification loop.*
To illustrate clarification: suppose a user asks, “What is the deductible?” on an insurance policy that lists separate deductibles for home, auto, and travel coverage. A naive system might guess and return a confident but incorrect answer. One that *can* ask, “Which coverage—home, auto, or travel?” resolves the ambiguity upfront.
This shifts a key constraint upstream into parsing. To recognize references like “page 3” or “the second appendix,” your parser must preserve metadata such as page numbers, section indices, and heading text for every text chunk. While page numbers seem trivial for a single document, they represent a foundational parsing decision that directly impacts how questions are handled. Article 5 explores this in depth.
Note: Question volume (e.g., “How many PDFs are in your corpus?” or “Are they all similar or varied?”) is a separate concern—it belongs to the data side, covered in diagnostic section 3.2 and detailed in Part IV (Articles 14–17). Mixing it with the question *control* axis would conflate two distinct issues, so we keep it out.
### 1.3 Mapping cases to technique zones
By combining the two axes—document structure and question type—every single-PDF RAG challenge falls into a specific zone on the grid. Each zone calls for a different technical approach. Most teams build for one or two zones and ignore the rest. The grid below is a practical thinking framework, not a rigid classification: real-world tasks often sit between adjacent zones, and boundaries are intentionally blurred.

The **top-left corner** (Tiers 1–2 for documents, A–B for questions) represents *deterministic territory*. Here, formats are fixed and questions are controlled. No LLM is needed for core extraction; at most, it serves as a fallback when templates drift slightly. This is where the insurance-broker case from the opening belongs. Most enterprise document workflows live here—and most are over-engineered. The broker’s annual €60,000 LLM setup could have been replaced with a few hundred lines of regex.
The **middle band** (document Tiers 2–4, question Tiers C–D) is the domain of *single-document RAG*. This is the “chat-with-your-PDF” demo every vendor loves to show. It’s genuinely challenging—and the bulk of this series focuses on it. When documents are structurally diverse and questions are open-ended, every stage matters: chunking (how you split the document), retrieval (finding the right pieces), reranking (refining the shortlist), and evaluation (verifying accuracy).
The **bottom row** (document Tier 5, all question tiers) is *vision territory*. Charts, schematics, and dense visual tables can’t be understood through text alone—no matter how smart the retrieval system. Vision models are essential here, and only here. Article 10 examines when the added cost of vision processing is justified—and when it isn’t.
Corpus-scale cases sit off the grid, since the grid is one PDF at a time. When the question targets many PDFs at once (“find every supplier contract with a liability cap below one million”), the diagnostic routes to Part IV (Articles 14-17): classification at ingestion, structured fields, SQL on the structured side, RAG on the residual unstructured questions.
The grid isn’t a recipe. It’s a sanity check. Locate your problem, look at the technique zone, and ask whether the system you’re building matches. If you’re building deeper than the case calls for, you’re paying for nothing. If you’re building shallower, you’ll discover the gap in production.
2. The techniques per case, and what isn’t a technique
Once you’ve placed your problem on the grid, you know roughly which family of techniques applies. The rest of the series develops each technique in detail.

The deterministic family (regex, section anchors that locate a heading by name, coordinate-based extraction that pulls a field from a fixed bounding box on the page) doesn’t have its own article. It’s the baseline every engineer should already know. Every engineer reading this series should already know how to write a regex. The point of including it on the map is to remind you that it’s an option. When the structure of your input is fixed, it’s the option.
The single-document RAG family is what Parts II and III of the series are about. Layout-aware parsing (Article 5), question parsing and calibration (Article 6), retrieval as scope selection (Article 7), generation as controlled execution (Article 8), hybrid retrieval and TOC routing (Article 9), adaptive parsing including vision (Article 10), cross-references (Article 11), listing and synthesis (Article 12), composite pipelines with feedback loops (Article 13). Each of these is a technique you’ll reach for in the central band of the grid.
The corpus-scale family is Part IV. The corpus problem (Article 14), preparing a queryable corpus from a folder of PDFs (Article 15), the corpus ontology (Article 16), querying with SQL filter first and retrieval second (Article 17). These come in when you go from one PDF to a corpus of PDFs.
If your problem is in the top-left corner of the grid, you can stop reading the series after Article 5 (parsing) and skip ahead to Article 15 (preparing a queryable corpus). If your problem sits in the middle band, you’ll need Parts II and III. If your problem is corpus-scale, you’ll need Part IV on top of the foundation. The map tells you which.
2.1 Pick the simplest technique that works
The instinct of every engineering team is to build the most powerful pipeline they can justify. That instinct is wrong here. The right instinct is to pick the least powerful technique that solves the actual problem. Three reasons:
- Cost: At two million docs a year, a regex on a VM is a rounding error; an LLM per document is sixty thousand euros.
- Latency: Microseconds vs seconds, the difference between “feels instant” and “feels like waiting”.
- Reliability: A regex either matches or it doesn’t and the engineer can read the rule; an LLM produces answers that are sometimes subtly wrong with failure modes harder to detect, which disqualifies it for audit-grade extraction.
Most production document workflows land on a hybrid: a deterministic core handling the bulk cleanly, with an LLM fallback for the cases where the format breaks. That hybrid is almost always the right shape, and almost never what teams build first.
2.2 Long context isn’t a way out
Every few months someone announces that “RAG is dead” because context windows just got bigger. The argument: dump the whole document in the prompt and let the model figure it out.
This works for one document and one user. It doesn’t work in production for four reasons:
- Wasteful: A typical question doesn’t need the whole document. The effective date of a contract sits on one page; sending the other thirty-nine pays for tokens that won’t be used.
- Misses information: Transformers reliably read what’s at the start and end of a long context and routinely skip what’s in the middle, so the relevant page might never be read even when it’s in the prompt.
- Doesn’t scale: Real use cases involve many documents. No context window will ever hold a corporate archive; at any meaningful scale you have to choose what to send, and that choice is retrieval.
- No grounded answer: Without explicit retrieval and citation, you can’t tell which part of the document the answer came from, you can’t verify it, you can’t audit it. For any enterprise use case where the answer needs to be traceable, that’s disqualifying.
Long contexts are useful as a tool, especially for single-document deep analysis. They’re not a substitute for retrieval. Anyone telling you otherwise is selling something.
2.3 Fancy techniques are usually keyword work in disguise
Techniques sold as “advanced” often turn out to be keyword work in another form, and often the wrong form. HyDE (Hypothetical Document Embeddings, Gao et al., 2022) is the clearest example. The protocol asks an LLM to write the hypothetical document that would answer the query, then retrieves against the embedding of that hypothetical. The pitch is that the hypothetical carries the vocabulary a real answer would use, widening the cosine margin.
The companion notebook tests this on the Attention paper: ask why multi-head attention, let HyDE generate its passage, compare against the actual vocabulary of section 3.2.2. The two lists overlap on exactly one phrase, the section title. HyDE writes ML-textbook vocabulary (semantic relationships, contextual dependencies, parallel processing, attention patterns); the paper writes operational vocabulary (attention layers, encoder-decoder attention, different positions, linear transformations).
HyDE understood the question. It never read the document. In enterprise the keywords exist somewhere on the page and the domain expert who has read the page knows them. HyDE pays per query to invent vocabulary that often does not even land on the page. The expert dictionary (Article 6), a curated list of the corpus’s actual vocabulary built once with the domain expert, gets the same job done at a fraction of the cost, reused across
For every question you ask from this point on.
2.4 Allowing the LLM to choose the case
Every pairing of a document tier with a question tier forms an elementary case, and each one maps to a specific matching technique. In Volume 1, the engineer selects the case at compile time and deploys the corresponding technique. The dispatcher (Article 13) captures the team’s routing logic in Python; the LLM evaluates outputs within predefined loops; and every component remains fully auditable. This approach covers most enterprise RAG scenarios adequately.
A logical next step would be to let the LLM select the case dynamically at runtime—examining the incoming question, categorizing it into the appropriate case, and then choosing the right technique to apply. This is precisely what the industry in 2026 refers to as agentic RAG. Volume 3 (Agentic Bricks) adds this runtime-selection layer on top of the foundational bricks introduced in Volume 1. The key difference lies in when and by whom the decision is made, not in the bricks themselves: agentic systems still rely on the same parsing, retrieval, and generation building blocks that Volume 1 validates and tests.
3. Identifying your real-world case
3.1 Design the system around the expert you already have
The following diagnostic requires one input that most teams overlook: who will actually use this system?
In nearly every enterprise RAG scenario, the answer is the expert who already knows the documents inside and out. This isn’t an open-ended user typing arbitrary questions, or a casual browser searching a public archive. It’s the lawyer reviewing a contract, the underwriter verifying a quote, the compliance officer examining a clause. These are people who have worked with documents like these for years—who understand the terminology, know where a single term carries two different meanings, and recognize the failure patterns to watch for.
With that understood, the system’s purpose becomes straightforward: amplify that expert’s capabilities, not replace them. Embed their vocabulary, their disambiguation rules, their accumulated heuristics into the pipeline. Let the system handle the sheer volume of documents; let the expert remain the ultimate authority.
This framing matters before you even look at the grid, because it shapes which cases are realistic for your situation. A team that declares “anyone should be able to ask anything about the entire archive” is implicitly committing to the bottom-right case: open-ended questions against a mixed corpus—the most difficult scenario. A team that specifies “our underwriter verifies a known field on a known document type” is landing in the top-left, often solvable with regex alone.
This framing is rarely dictated by the nature of the documents or the questions themselves. It’s a deliberate choice the team makes, often by default, influenced by consumer chatbot assumptions they never consciously examined. Start by anchoring the system around the expert who is already there. Then locate your position on the grid that this answer points to.
3.2 The diagnostic questions
Before writing a single line of code, walk through these questions together with the domain experts, out loud, at a whiteboard.
About the documents: How similar are documents across the entire corpus? Are they native text or OCR-generated? How many PDFs are you dealing with, and do they follow a uniform structure or vary widely? (This is where corpus-wide considerations enter the picture—they lead into Part IV.) Is the content static or updated daily? Where on the document spectrum does your data fall?
About the questions: Who formulates them—an engineer during design, or an end user at runtime? Does the system operate in a single pass, or can it ask clarifying follow-up questions? Is the answer always found in a single document, or spread across multiple ones? What does no answer found mean: is it acceptable, or a critical failure? Where on the question spectrum does your scenario fall?
About the constraints: Must every answer be traceable back to its source? How rigorous does the precision need to be (best-effort, or audit-grade: every citation linked to a specific source line, every answer fully reproducible)? What is the cost budget per document? Sometimes the line between using regex and using an LLM is the line between a profitable system and an unprofitable one.
The answers to these questions guide you to a specific case on the grid. That case directs you to a technique zone, which in turn identifies the articles in this series that you’ll need most.
3.3 Common enterprise cases on the grid
A small number of patterns appear again and again in real-world projects. Most readers will see their own situation reflected in at least one of these.
Extracting fields from a fixed-template form. Consider insurance certificates from a single broker, KYC forms from one bank, or tax filings from one government body—the same software generates the same layout on every page. Case: doc tier 1, question A, top-left corner. Stack: regex applied to coordinate-addressable fields, with an LLM as a fallback for the occasional layout shift. A full-blown RAG pipeline is overkill here, and applying one is the most common mistake we encounter in practice.
Extracting fields across template variants. Think invoices from hundreds of suppliers, lease agreements across various landlords, or employment contracts across companies under the same legal framework—each document follows one of a few recognizable layouts. Case: doc tier 2, question A or B. Stack: a dedicated regex for each recognized template, supplemented by few-shot LLM extraction when a document doesn’t match any registered template. Classify first, then extract.
Q&A on a long, bespoke contract: Each contract has a different structure, sections differ, and ten-page glossaries don’t repeat. The user poses free-form questions about the specific contract in front of them. Case: doc tier 3, question C or D, middle band. Stack: full single-document RAG with table-of-contents routing, hybrid retrieval, and structured generation. This is the scenario where all four bricks in the series each play a meaningful role.
Reading a slide deck or a schematic: Think engineering drawings, financial presentations where the key data lives in charts, or technical specifications with embedded images—pure text parsing will miss the answer entirely. Case: doc tier 5, any question column, bottom row. Stack: a vision-capable model processing the page image, combined with text-based RAG for the prose surrounding the visuals.
Off the grid – corpus-level territory: “Find every supplier contract with a liability cap below one million” across hundreds or thousands of contracts. The single-PDF grid is no longer the right framework; the question targets the corpus as a whole, not an individual document. Stack: field extraction at ingestion time, structured fields stored in a database, SQL queries on the structured side, and RAG only as a fallback for remaining unstructured questions. Articles 14–17 (Part IV) cover this in depth.
Off the grid – no structure to rely on: a novel, an intent classification task, sarcasm detection. The document has no predictable structure, the vocabulary lacks distinctive terms, and the question demands understanding tone or intent rather than locating a specific passage. Stack: an LLM that reads the entire text paragraph by paragraph, deciding what to flag. This isn’t a RAG problem in the Volume 1 sense; section 2.4 offers a hint about where this kind of runtime decision-making fits (Volume 3).
If your situation doesn’t align neatly with any of these patterns, work through the diagnostic in section 3.2—the result will tell you which pattern above is the closest match.
4. Conclusion
Run the diagnostic on your own corpus before writing any code, ideally with the domain experts present.
The output is the list of articles in the rest of the series that are worth reading, along with the ones you can safely skip. The teams that successfully get RAG into production are the ones that figured out where their problem sits on the grid before writing any code. The teams that end up tweaking things for six months are typically the ones that started building before they fully understood the problem.
The next article kicks off Part II by focusing on the first foundational brick: document parsing. Whatever gets lost at that stage has no way of being recovered later, regardless of how sophisticated the retrieval pipeline is.
5. Sources and further reading
The two-axis grid maps each approach against two dimensions: document complexity and question control on a single PDF. The claim that long-context doesn’t replace retrieval — which the grid framework draws on — is backed by Liu et al. (Lost in the Middle, TACL 2024) and Lee et al. (long-context benchmark, 2024). The vision row ties back to Faysse et al. (ColPali, 2024). The HyDE demo draws on the technique from Gao et al. (HyDE, 2022). The agentic approach hinted at in section 2.4 — where the LLM chooses the strategy at runtime — is the path that Volume 3 explores, built on top of the groundwork established here.
Continuing along the same path as this article:
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024 (arXiv:2307.03172). Models consistently overlook information buried in the middle of long inputs. Backing for the view that long context isn’t a complete fix.
- Lee et al., Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?, 2024 (arXiv:2406.13121). Real data showing where long-context can replace retrieval and where it hits its limits.
- Faysse et al., ColPali: Efficient Document Retrieval with Vision Language Models, 2024 (arXiv:2407.01449). Retrieval grounded in the visual appearance of the page itself. The key reference for the visual row of the grid.
- Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), 2022 (arXiv:2212.10496). The hypothetical-document-embedding method put to the test in section 2.3.
A different perspective from a different setting:
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023 (arXiv:2210.03629). The foundational paper on letting LLMs dynamically select tools at runtime. Volume 3 picks up this thread on top of the foundation Volume 1 lays down.
- Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023 (arXiv:2302.04761). Along the same lines as ReAct.
- Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, 2024 (arXiv:2312.10997). A broad survey of RAG that treats the field as a single unified paradigm with shared challenges (retriever quality, generator faithfulness).



