It took six months of dedicated work to refine their RAG pipeline.
- They executed five Optuna optimization runs.
- They implemented a custom reranker.
- They fine-tuned an embedding model using their proprietary data.
Production accuracy remained unchanged. Pilots continued flagging identical incorrect answers. After half a year, the root cause turned out to be in the parser.
The team wasn’t just stuck; they were looking in the wrong direction. RAG isn’t machine learning, and applying ML tools addresses the wrong challenge. This represents the most costly misunderstanding in enterprise RAG initiatives today. It eats up months of effort, assigns people to the wrong tasks, and gradually undermines confidence in the solution.
RAG resembles machine learning closely enough that ML toolkits seem like the logical choice. The techniques (hyperparameter tuning, evaluation datasets, explainability methods) aren’t inherently flawed on their own. They’re borrowed from an incorrect domain. Strategies that succeed for training models fall short when constructing search systems.
This isn’t to say ML is flawed. The embedding model driving vector search is indeed a deep learning model, but you use it rather than train it. The real issue is that the framework you’re constructing around it isn’t a model, and approaching it as one squanders effort, selects incorrect metrics, recruits unsuitable personnel, and obscures the actual breakdown points.
The assertion that “RAG isn’t ML” forms part of Enterprise Document Intelligence Volume 1, which constructs enterprise RAG systematically. The four core components (parsing, question parsing, retrieval, generation) constitute the engineering toolkit referenced in this article.
1. Two distinct challenges
Machine learning tackles problems where the correct response is uncertain and requires prediction. Will this client leave? Is this transaction fraudulent? Does this image show a cat? The answer isn’t known beforehand. That’s the reason for training a model. The model learns from labeled samples, generalizes to unseen cases, and generates predictions. Success is evaluated collectively, across thousands of test instances, since single predictions might prove incorrect while the model remains practically valuable.
RAG addresses a separate challenge. The response to “what’s the effective date of this contract?” is already specified on page one of the document, or it’s absent entirely. There’s nothing to forecast. The system either locates the answer within the document and communicates it accurately, or it doesn’t locate it and should acknowledge failure. Success is pass/fail at the question level (found it or not) even when you assess overall rates across numerous questions.
These distinctions are tangible:
- In ML, “the model erred on 8% of cases” represents a system characteristic. You build redundancy, implement verification steps, assign human reviewers for uncertain cases. In RAG, “the system provided an incorrect answer 8% of the time” is a defect. Every instance within that 8% stems from a distinct reason: an irrelevant passage was retrieved, the correct passage was found but the model paraphrased it poorly, or the answer was absent from the corpus and the system fabricated one. These aren’t statistical variations to average out. Each failure can be specifically corrected.
- In ML, you usually can’t determine why the model mishandled a specific case. That’s precisely why explainability constitutes an active research domain. In RAG, you can always identify the cause. The retrieval process records which passages were returned. The generator had access to exactly those passages. If the response is incorrect, you trace back through the pipeline and pinpoint the failure. Nothing remains concealed.
- In ML, the model improves through additional training data. In RAG, the system improves through better indexing, more careful parsing, more targeted retrieval, and clearer prompting. None of these activities constitutes training. They’re engineering tasks.
That distinction determines which solutions you reach for when problems arise.
The failures documented in Article 2 fall precisely in this category: negation handling, precise identifiers, internal acronyms, signal degradation in lengthy contexts, and topical proximity overshadowing actual answers. None of these issues improve when you exchange embedding models or search through chunk size variations. They aren’t defects that a model can learn to avoid, because no labeled indicator exists stating “this is the correct response” for the model to learn from. Each fix requires structural changes (question analysis, domain-specific keywords, retrieval that understands document architecture), and the following segments examine the three ML habits that select the wrong solution.
2. Three arguments that miss the mark
Three ML approaches routinely get adopted into RAG initiatives by default: hyperparameter optimization, evaluation datasets with train/test divisions, and feature-attribution explainability. Each makes sense within ML. Each falls short in this context.
2.1 The hyperparameter argument
The typical reasoning follows this pattern: chunk size, overlap, top-k, similarity threshold. These behave like hyperparameters, so they should be optimized similarly to ML models, using frameworks such as Optuna or Ray Tune. Run simulations, visualize the results, select the strongest setup.
In these arrangements, top_k defines how many passages the retriever preserves, while similarity_threshold sets the minimum cosine similarity score required for passage inclusion. The following code declares all four as tunable parameters:
# Typical team implementation (and why it's the wrong focus)
import optuna
def objective(trial):
chunk_size = trial.suggest_int("chunk_size", 100, 2000)
chunk_overlap = trial.suggest_int("chunk_overlap", 0, 200)
top_k = trial.suggest_int("top_k", 1, 20)
threshold = trial.suggest_float("threshold", 0.5, 0.95)
accuracy = run_rag_pipeline_and_score(
chunk_size, chunk_overlap, top_k, threshold
)
return accuracy
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200) # two weeks of compute later...There’s some validity to this perspective. These parameters do influence retrieval effectiveness, and they deserve tuning. The issue begins with the term “hyperparameter,” which carries implicit assumptions.
Within machine learning, a hyperparameter governs model training: learning rate, regularization intensity, layer count. The model itself evolves during training; the hyperparameter influences that evolution. In RAG, no learning occurs. The chunk size doesn’t direct learning behavior. It governs a function that splits text identically each iteration, independent of prior inputs.
What appears to be hyperparameter tuning is really configuration selection, comparable to adjusting a search engine setup. The competence needed to optimize it isn’t statistical optimization. It’s comprehension of your document organization and user question patterns. A chunk size of 512 tokens might excel with dense academic texts but fail completely on insurance policies where a single clause extends across 800 tokens, and splitting it destroys the conditional logic essential to interpretation. No systematic search will reveal this. You must study your actual content.
This explains why teams that exhaustively test different chunk sizes usually land on a “best” value that shows slight improvement on test data but performs no differently in production. The perceived optimum was a quirk of the test set, not a real system-level gain. They fine-tuned a figure without actually addressing the underlying issue.
Common pitfall: A team spends two weeks running Optuna over chunk_size, top_k, and similarity_threshold, eventually settling on chunk_size=487 without understanding the rationale. If asked “why 487?”, the only response is “Optuna picked it.” That reasoning falls apart during a genuine production issue and doesn’t hold up when the document mix changes. A chunk size of 500 chosen because it roughly matches paragraph length in the current corpus is far easier to defend than 487 chosen because a parameter sweep happened to land there.
The real task isn’t adjusting numeric parameters. It’s making structural choices about chunking. By section? By paragraph? By table of contents entries? By question category, using different chunking approaches for quick lookups versus lengthy clauses? The answer comes from examining documents and questions, not from optimization curves.
There’s a more fundamental reason chunk size defies optimization: by nature, no single chunk size can handle every question. Consider two questions about the same insurance policy:
- “What is the effective date?” The answer is a single line on page one. It needs a chunk small enough to isolate that line precisely.
- “What are the policy exclusions?” The answer could span one page or three pages, depending on how the insurer drafted it. It needs a chunk large enough to encompass the whole section.
No single value satisfies both. A 200-token chunk size fragments the exclusions into disjointed pieces. A 2000-token chunk size drowns the effective date in irrelevant surrounding text.
Searching for “the ideal chunk size” isn’t really a tuning problem. The premise is flawed: no one number can serve a range of questions whose answers vary in length.
In theory, you could make chunk size adapt to each question by training a lightweight model that predicts the right chunker based on question features: classify the intent, estimate the expected answer length, and output a strategy. That would be a legitimate application of machine learning to a problem where something is genuinely being learned.
But that’s unnecessary. You can just write the rule outright. Looking at a question, you can see whether it requests a date, a section, or a comparison. So can a subject-matter expert. So can a few lines of Python with keyword-based conditions. The deeper reason RAG isn’t a machine learning problem is that, for most decisions within the system, you already know the answer, or someone on your team does. Machine learning is for problems where no one knows the answer ahead of time.
The right path is to abandon the quest for one universal chunk size and instead route different question types to different retrieval strategies:
# What to do instead: route by question type
def chunk_for_question(question: str, line_df, toc_df):
intent = classify_intent(question)
if intent == "point_lookup": # "what is the effective date?"
return chunk_by_line(line_df)
elif intent == "section_retrieval": # "what are the exclusions?"
return chunk_by_toc_section(line_df, toc_df)
elif intent == "comparison": # "compare clauses A and B"
return chunk_by_full_section(line_df, toc_df)The two code blocks above capture the full argument of this section. The first runs Optuna over four parameters for two weeks and produces an indefensible number. The second makes one structural decision per question type and yields a system anyone can explain.
Later articles cover how to classify intent (Article 6, on question understanding) and how the various retrieval methods and granularities are built (Article 7, on retrieval). The point here is simply that the work isn’t tuning—it’s routing.
2.2 The evaluation dataset argument
Another ML-minded import is evaluation methodology. The thinking is: RAG, like any ML system, requires a proper evaluation dataset—questions paired with expected answers, split into training and test sets, scored with precision and recall. Frameworks like RAGAS make this even more appealing, providing metrics for faithfulness, answer relevancy, and context recall that feel convincingly ML-like.
Evaluation is valuable. The question isn’t whether to evaluate. It’s what the metrics actually indicate. In machine learning, evaluation reveals whether a model has generalized from training data to unseen cases. The train/test split exists to catch overfitting—a model that memorized the training set rather than learning a transferable pattern.
In RAG, there’s nothing to generalize. Overfitting can’t occur: the system doesn’t change between queries. The retriever computes the same cosine distances every time. The generator follows the same prompt template. No model adapts to data.
What evaluation measures in RAG is three things, all relating to coverage and quality rather than statistical generalization:
- Does my corpus contain the answer? If not, the system can’t surface it. This is a content question, not a model question.
- Does my retriever locate the correct passage? If the answer exists in the corpus but the retriever missed it, the system breaks down. This is a search question.
- Does my generator stay faithful to what was retrieved? If the right passage was found but the model paraphrased it poorly or hallucinated extras, the system breaks down. This is a generation discipline question.
Each one calls for a specific fix. Combining them under a single “accuracy” score erases useful information. A 75% accuracy stemming from “the corpus lacks answers for 25% of documented topics” requires different action than a 75% accuracy from “the retriever misses the correct passage 25% of the time.” The first demands ingesting more documents. The second demands improving the retriever. A combined metric that treats them identically obscures the diagnosis.
This also explains why teams using RAGAS-style frameworks sometimes report strong metrics on a held-out test set only to see the system falter in production. The test set covered topics where the corpus had answers and the retriever happened to locate them. Production encounters questions whose answers aren’t in the corpus at all, and the system either hallucinates or fails to say “not found.” The metric was high on the test set because the test set was easy. The system isn’t broken—the evaluation was.
What you need to evaluate, broken down by question type, takes about ten lines:
Below is the rewritten version for each section.
# For each question and intent type, measure retrieval performance
def evaluate_retrieval(reference_set, retrieve_fn):
rows = []
for ref in reference_set:
retrieved_lines = retrieve_fn(ref.question)
recall = len(set(retrieved_lines) and set(ref.expected_lines)) / len(ref.expected_lines)
rows.append({
"question": ref.question,
"intent": ref.intent,
"recall": recall,
"hit": recall > 0,
})
return pd.DataFrame(rows)
# Always break down by question type, never just a summary
df.groupby("intent")["hit"].mean()
# point_lookup 0.92
# section_retrieval 0.41 <-- this is the real issue
# comparison 0.55A single overall accuracy score of 63% would have completely masked the disaster on section_retrieval. Separating the results by intent makes the problem immediately obvious. In this context, recall measures: on questions where the correct answer exists in the corpus, was the relevant passage actually retrieved? Grouping by intent (point_lookup, section_retrieval, etc.) reveals which category of question is failing and, by extension, which component of the pipeline needs fixing.
RAG has two very different evaluation surfaces to consider.
The retrieval side is fundamentally a search challenge: did the correct passage end up in front of the model? To measure this, you check a reference set of questions and verify whether the relevant lines or pages were fetched at all. The key metric is recall at whatever level matters to you — line, page, or section — and it’s unique to your own corpus. Nobody else can run this evaluation for you because your data is your data. This is where most of your evaluation effort should be focused.
The generation side is a different beast entirely. Once the right passage has been retrieved, the question becomes whether the model gives a well-grounded answer in the correct format, with proper citations, and responds with a clear “not found” when the passage doesn’t contain the answer. Some of this you’ll evaluate yourself, but most of it has already been tested extensively by LLM vendors. Companies like OpenAI, Anthropic, and Mistral invest heavily in ensuring their models follow JSON schemas, resist inventing facts, and comply with prompt instructions. These are the specific areas where they continuously improve. As a RAG builder, you’re not training the generator — you’re relying on it. If the model struggles badly with structured JSON output or deviates from its inputs, you’ll notice within an hour or two of integration. Those aren’t metrics you need to define — they’re sanity checks that are immediately apparent.
The bottom line: the majority of your evaluation time should go toward retrieval testing, which is corpus-specific and only you can do it, rather than generation testing, which is largely the vendor’s responsibility and where glaring problems surface quickly. Teams that invest weeks building elaborate generation evaluation suites are usually just avoiding the harder retrieval work that would actually move the needle.
Learn more: Evaluating Your System (covered later in this series) details how to build a reference set tailored to your specific corpus, the four key metrics that actually matter, and why per-question-type metrics are essential while aggregated metrics are misleading.
2.3 The Explainability Question
Machine learning has a well-established set of explainability tools. SHAP values to link predictions to specific features, LIME to build local approximations of complex models, and attention visualizations for transformers. When people begin asking about RAG explainability — “why did the system give this answer?” — they instinctively gravitate toward these methods. They want to measure retrieval relevance, weight the contribution of individual documents, and visualize which tokens shaped the output.
The irony is that RAG is inherently more transparent than most ML approaches. There’s no need for SHAP. There’s no black box to crack open. The system pulled these specific passages from these specific sources, and the answer was derived from them. That is the explanation. It’s factual, not inferential.
This reveals a deeper imbalance between traditional machine learning and RAG. In ML, humans have gut instincts but struggle to pin down exact numbers. Ask anyone who survived the Titanic and they’ll say wealth, emotions, social class — all partially correct, but none exact. The model has no such hesitation: fit a decision tree and the root split is sex, the next division is a precise age threshold no one would have guessed, followed by class. Each split produces a number that intuition alone could never yield. The model’s job is to surface those concrete figures.
With text data, the dynamic flips. The user can read the source directly. A lawyer examining a contract sees the terms, the exceptions, the timelines. A compliance officer reading a policy document already knows what constitutes a violation. The text doesn’t obscure its meaning, and the expert is already a competent reader.
There are exceptions — sarcasm and irony are the textbook examples, where modern LLMs can sometimes detect what a literal reading misses. But in enterprise settings, the user is typically the domain expert.
The model isn’t there to interpret the text. It’s there to do the reading across an entire corpus, and a simple citation is enough to let that expert confirm any answer in moments.
When a user asks “why this answer?”, the appropriate response isn’t a heatmap of attention weights or a feature attribution score. It’s: “I found the relevant details on pages 12, 47, and 89 of this contract. Here’s exactly what I read. The answer is directly drawn from those passages.” If the user disagrees with the answer, they can read the cited source themselves and decide. They don’t need an explainability framework — they need a citation.
The fifty-line pipeline described in Article 1 already demonstrated this. The prompt instructed the model to return start and end line numbers (mapped to their page locations) along with the answer, formatted as structured JSON. The annotator then highlighted those exact lines on the PDF. No SHAP, no LIME, no attention visualizations, no specialized observability tool. The “explanation” was a natural byproduct of how the prompt was designed. The citation is the answer’s proof, not an extra analysis layer bolted on afterward.
The execution trace is the explanation. Reading it requires no interpretation — just reading.
Bringing ML explainability techniques into RAG is addressing a problem that doesn’t exist. Applying SHAP to a retrieval score is like using a scalpel to open a mailbox. The retrieval score is already a number you computed from inputs you can inspect directly. There’s nothing to attribute that you can’t already see.
The fundamental flaw in adopting the ML-explainability mindset is that it shifts your attention in the wrong direction. You end up trying to explain why one passage outscored another in vector space — an almost unanswerable question that ultimately doesn’t matter. What truly matters is whether the right passage was retrieved at all, and whether the answer faithfully reflects what was found.
Those are questions you can resolve by examining the logs and source code directly—no additional tooling required.
3. What shifts when you truly understand RAG
Once you move past the misconception that RAG is machine learning, two major shifts occur. First, your daily tools, success metrics, and team structure realign around search instead of model training. Second, a more fundamental question — where the intelligence resides — shifts from the algorithm to the human team. Both shifts stem from the same reframing.
3.1 Tools, metrics, and people
Three specific things transform.
Your tooling changes: You no longer need PyTorch, a GPU cluster, or hyperparameter tuning frameworks to build the core system. What you do need is a reliable parser, a capable retriever, deliberate prompt engineering, and thorough structured logging of every step. The pieces that are ML-based (the embedding model, the LLM) you use as third-party services — commodity building blocks, not things you develop from scratch.
Your metrics change: Overall accuracy scores get replaced by granular, failure-moded-focused metrics: retrieval recall (did the system find the correct passage?), answer faithfulness (did the model stick to what the passage actually says?), extraction accuracy (when pulling structured data, do the values match?), and not-found rate (when the answer truly isn’t in the corpus, did the system say so cleanly?). Each metric targets something precise, and each one maps to a specific pipeline component you can address directly.
Your team composition changes: A team made up entirely of ML specialists often overlooks what actually makes a RAG system succeed — or fail. The most critical skills are software engineering (the system has many interlocking parts that must work together cleanly), domain expertise (someone must understand what a correct answer to a domain-specific question even looks like), and information retrieval thinking (someone needs to reason like a search engine architect, not a model trainer). ML knowledge helps, but it’s not the primary skill. A team of ML researchers with no domain expert will build an elegantly optimized system that fundamentally misses the point. A team with one ML-aware engineer, two software engineers, and one domain expert will typically outperform them.
3.2 Where the intelligence actually lives
The shift in team structure reflects a deeper reality: where does the system’s intelligence truly reside?
In a traditional ML system, the intelligence lives inside the model. The model captures the patterns. The team supplies training data and adjusts the loss function. In a RAG system, the intelligence lives with the team. The lawyer knows which contract clauses to examine first. The underwriter knows what “deductible” means and which page typically contains it. The compliance officer knows which regulation applies to which product. None of that knowledge exists inside the embedding model. None of it emerges from hyperparameter tuning. It already lives in the minds of people who have spent years reading these documents.
Watch an experienced underwriter open a fresh policy. She doesn’t read it from start to finish. She jumps directly to the exclusions section because she’s reviewed five hundred policies and knows that’s where the pitfalls hide. She checks the schedule of benefits for deductibles and coverage limits. She reviews the territory clause. Within three minutes, she has a sharper grasp of the contract than any embedding model would develop even after processing a thousand such documents. That learned instinct is what the system needs to amplify.
3.3 Amplifying expert judgment, brick by brick
The purpose of an enterprise RAG system is to scale that expertise, not to replicate it. How this manifests depends on the component.
Parsing is where it all begins. If the parser converts a contract’s PDF into garbled text, nothing downstream can fix it. If the document contains a functional table of contents, the parser must extract it cleanly — because that TOC is the navigation tool the expert depends on. When a document lacks a TOC entirely (scanned faxes, slides exported as PDF, older typed policies), reconstructing one becomes a task in itself, often more impactful than any retrieval improvement.
Question understanding bridges the team’s vocabulary across the gap between how a user asks a question and how the document phrases the answer. The pilot user searches for kettle, while the contract uses small electrical appliance. The compliance officer types data breach, whereas the policy says unauthorized disclosure of personal information. The expert understands this mapping. The question parser codifies it into a lookup table: translations across languages, spelling variants, plural forms, and internal abbreviations. None of this is learned from training data — it’s defined by the expert and documented.
Retrieval scales up what the expert already does manually. The expert can already search keywords — that part is straightforward. What the expert cannot do at scale is run regex patterns across thousands of pages, verify whether two terms co-occur within the same paragraph, or apply boolean logic across the entire corpus. The retriever handles those tasks quickly, then surfaces the results so the expert can validate them.
Generation handles the two tasks the expert would otherwise do manually: citing the exact passage that justifies the answer, and converting the raw value into a usable format. The string 3455434 in the document becomes €3,455,434 in the response. 20260516 becomes May 16, 2026. A phrase like thirty days from the date of the loss stays exactly as written, with a citation linking back to the clause so the expert can verify it in one click.
Articles 5, 6, 7, and 8 each develop one of these components in detail: the parser that reconstructs TOC structure, the expert vocabulary dictionary, the TOC-aware retriever, and the typed-answer generator. The principle is identical every time: identify a specific piece of human expertise and offload the repetitive parts to the machine.
This is also why the series exercises caution around autonomous agents. It defaults to keyword retrieval over embedding similarity. It treats reranker tuning as a last-resort option. Each of those defaults assumes there’s no expert available. In enterprise settings, the expert is always present — and the system should defer to them.
If your situation involves no available experts, highly open-ended questions, or extremely varied document types, this series won’t be your best resource. General-purpose retrieval and autonomous agents are a more appropriate fit for those scenarios.
4. Two parts, two failure modes
A helpful mental model for RAG is: a search engine, paired with an LLM that composes the answer. Two components, each with a distinct responsibility, and each with its own characteristic way of failing.
The search engine pulls relevant passages from documents. Given a question, it returns the lines, paragraphs, or sections most likely to hold the answer. This is a pure search challenge: selectivity, recall, ranking. Decades of information retrieval research are directly applicable. The fact that part of the system uses neural embeddings doesn’t alter its fundamental nature — embedding similarity is simply one ranking signal among many.
The LLM takes a retrieved passage and the original question and produces a natural-language answer with a citation. The LLM doesn’t locate the answer — the search engine already handled that. The LLM expresses the answer using a passage it was given.
It functions more like a translator or scribe than a fortune teller.
Referring back to the four building blocks introduced in Article 1: parsing, understanding the question, and fetching information form the search engine; generating the answer is the job of the LLM. The “brick view” is how the system actually runs (one piece of code per block); the “two-part view” is the mental framework you use when troubleshooting issues.
These two components fail in distinct ways, and troubleshooting begins at the point where they connect. Look at the log for a failed query: did the model actually see the retrieved text, and did that text include the correct answer?
If the correct answer was missing from the retrieved text, the search engine is at fault, and the solution lies earlier in the process. Was the correct page ruined by the parser (OCR mistakes, multi-word terms broken across lines, two-column text mixed up)? Did the question analyzer overlook a synonym that the expert vocabulary should have included? Did the search tool rank the correct page outside the top_k results, or did it stumble over punctuation that required a regex? Or is the document simply missing from the database? These are four very different solutions, all upstream. The phrase “tune the retriever” is useless until you pinpoint the exact issue. The same four building blocks that boost expert performance when they work (section 3.3) each have their own unique failure modes, each covered in its own detailed article (Articles 5, 6, 7).
If the correct answer was present in the retrieved text but the output is still wrong, the LLM is at fault, and the fix lies further down the line. Typical issues include: the model reworded the answer and dropped a specific condition, it returned the raw code 3455434 because the format allowed a free-form answer, it referenced the wrong line numbers, it made up a value not found in the text, or it gave an answer when it should have responded “not found”. These are five generation errors, each requiring a different solution, all within the prompt, format, or post-check layer (Article 8). None of these improve by adjusting the retriever.
Here is how this troubleshooting works in real life. A user asks, “how many attention heads does the base Transformer model have?” (The answer is 8, found on page 5 of the Attention Is All You Need paper by Vaswani et al., 2017; arXiv non-exclusive distribution license, noted on the arXiv abstract page). The system replies “16”. Check the log.
The search returned pages 4, 7, and 8. None of these contain the base model’s setup: page 8 discusses the large model (which indeed uses 16 heads), while pages 4 and 7 cover the encoder’s design. The generator read the incorrect pages and reported the number it found there. The error is in retrieval, not generation.
Why did the search skip page 5? The search terms were ['heads', 'base', 'model']. Page 7 mentions heads six times; page 5 mentions it only twice. The keyword-based retriever ranked page 7 higher because it scored based on how often a word appeared, without verifying if base, model, and heads appeared together on the same line. A five-line change in the Python code for the keyword retriever resolves this.
What did not happen: no one fine-tuned any model. No one ran a parameter sweep. No one added a reranking tool. The diagnosis took five minutes; the fix took an afternoon.
This clear division is what makes RAG practical to use. Every failure has a specific component to address. There is no training process where retrieval and generation become intertwined. They are separate parts, combined neatly, each one swappable on its own. Real-world systems benefit greatly from this: you can switch embedding models, switch LLMs, or switch parsers without needing to retrain anything.
The entire pipeline is setup, not a model.
When an issue arises, you adjust a setting: the search method, the prompt, the format, or a validation rule. You do not retrain. You modify a Python file, deploy it, check the performance metric for the affected question type, and verify the fix. The update cycle takes hours, not weeks.
Once you view RAG as a system to configure rather than a behavior to train, the rest of the series’ decisions become obvious.
5. Six months spent on the wrong issue
A team at a medium-sized company is tasked with building a RAG system for a few thousand internal documents over six months. They begin by creating a test set of 500 questions, dividing it 70/30 into training and testing groups. They configure Optuna to test different chunk sizes, overlaps, top-k values, and similarity thresholds. The first round of testing takes a week of computing power, returns a “best” setup, and the team releases it for internal review.
The test users complain right away. The system responds smoothly but is incorrect 50% of the time on questions the reviewers clearly know the answers to: questions about specific clauses, exact dates, or specific numerical limits. The team’s reaction is to grow the test set, run another round of tests, fine-tune the embedding model using artificial question-document pairs, and add a reranker. Another three months pass. The accuracy in production stays the same.
The root cause: the parser was handling scanned pages with poor OCR layers as if they were clean digital text. Roughly 30% of the documents were essentially unreadable, but the team’s test questions were taken from the readable 70%. No amount of chunk size tweaking, embedding fine-tuning, or reranker addition could solve this: a third of the documents were producing nonsense. A two-day effort to inspect each page (the focus of Article 5, on parsing) would have identified this on the very first day.
The team had spent six months in ML mode (testing hyperparameters, expanding test sets, fine-tuning models) when the solution was simply changing the parser.

This account is a composite, but every detail has occurred in actual projects. The trend is always the same: ML habits push the team toward optimization tasks that seem productive, while the core problems remain unsolved in the parser, the document set, or the “not found” logic. The first response to a failing RAG system should not be “let’s tune it.” It should be “let’s track what happens to a failing query, from start to finish, and identify the broken step.”
6. Conclusion
RAG appears to be machine learning. The similarity is only surface-level. The answer is either in the document or it is not. There is no statistical generalization, no learning curve, and no train/test split that reflects real-world failures. The correct approach is assembling a search engine: a search engine combined with an LLM, two parts you can repair separately, using per-failure-mode metrics instead of overall accuracy.
The price of sticking to the ML mindset is not theoretical. It is six months of hard work directed at the wrong problem. Article 4 translates the correct approach into a practical diagnostic tool: RAG issues fall into a grid of document complexity versus question control, and each box requires a different technical stack.
Article 4 serves as an introduction to Enterprise Document Intelligence Volume 1. This series guides you through building a Retrieval-Augmented Generation (RAG) system step by step, covering essential components like document parsing, question analysis, information retrieval, and answer generation. The methodology emphasizes engineering solutions over complex machine learning models.

7. Recommended Additional Reading
The article situates RAG within the 50-year history of Information Retrieval (IR), rather than the machine learning field. This perspective is backed by the empirical findings of Thakur et al., who demonstrated that the classic BM25 algorithm frequently performs better than modern dense retrieval models when dealing with unfamiliar data. The method of analyzing systems based on specific failure modes, as used in this article, aligns with the research by Barnett et al. on engineering RAG systems. The article also includes a candid note: while the core system is built on engineering principles, the reranker component is a minor machine learning layer. For explainability, the system uses a straightforward concept: citations serve as explanations. By providing source lines for every answer, users can verify the information directly, eliminating the need for the complex explainability tools often required in ML projects.
This article builds upon similar ideas:
- Manning, Raghavan, Schütze, Introduction to Information Retrieval (Cambridge, 2008). The foundational text for the 50-year IR tradition that the article references.
- Thakur et al., the BEIR benchmark, NeurIPS 2021 (arXiv:2104.08663). This study provides key evidence that dense retrieval models trained on MS MARCO often underperform compared to BM25 on unfamiliar datasets. These findings strongly support the article’s argument for using IR techniques instead of ML.
- Barnett et al., Seven Failure Points When Engineering an RAG System, 2024 (arXiv:2401.05856). This work provides a practical taxonomy of common issues in RAG systems, which is the same focus as the article’s per-failure-mode approach.
- Kamradt, Needle in a Haystack (2023). A well-known benchmark for testing how well a model can find a hidden piece of information in a long text. This is primarily for research purposes, as it uses a simple fact-retrieval test rather than the complex queries common in business environments. It is discussed further in Series Articles 1 and 7.
These resources offer a different perspective or focus:
- Es et al., RAGAS: Automated Evaluation of Retrieval Augmented Generation, EACL 2024 (arXiv:2309.15217). This paper evaluates RAG using machine learning metrics like faithfulness and answer relevance on public datasets. This contrasts with the article’s focus on measuring failure rates within a specific company’s documents.
- Saad-Falcon et al., ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems, NAACL 2024 (arXiv:2311.09476). A framework for ML-based RAG evaluation that relies on creating artificial training and test sets. The article argues against this split-test approach, explaining that for business RAG, the information you need either exists in the company’s records or it doesn’t.
- Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020 (arXiv:2005.11401). The original paper that coined the term ‘RAG’ and described a system where the retriever and generator are trained together. While it is a crucial reference, it is important to note that this paper approaches RAG from a machine learning perspective, which differs from the engineering-first method highlighted in this article.



