Inside The RAG Evaluation Trap: When Your Metrics Lie To You

is a unique form of casual conversation, commonly seen in workplace settings near the water cooler. In these moments, workers tend to exchange all sorts of company rumors, urban myths, folk tales, questionable scientific views, overly personal stories, or even complete fabrications. Absolutely anything gets shared. In my Water Cooler Small Talk posts, I explore bizarre and usually scientifically flawed viewpoints that I, my friends, or someone I know have overheard at work—opinions that genuinely left us at a loss for words.

So, here is the water cooler take from today’s post:

We’ve developed a RAG application that’s performing really well. We’re currently in the evaluation phase, and things are going smoothly because throughout all our testing, we keep spotting problems and resolving them. We’ve already reached a 97% score.

Now, I’d like you to stop for a moment and consider what might be flawed about this claim. 🤔 Because at first glance, it sounds completely logical. Discovering problems and addressing them sounds like precisely what a solid evaluation process should involve, right? Even admirable, perhaps. So what’s actually going on here?

The issue is subtle yet critical. If your evaluation process is being used to spot problems, which you then fix, and then you reassess using the identical test cases, you’re no longer truly evaluating. The evaluation set has one essential quality that makes it valuable: the model hasn’t been exposed to it previously. Every time you refine the system based on its outcomes and then retest on the same data, you erode a bit more of that quality. Put simply, the evaluation set has silently merged into the development workflow and now functions more like a training set.

However, executing this correctly is far easier in theory than in practice. In reality, conducting evaluations properly can be genuinely draining. Especially when it comes to assessing RAG applications—where the evaluation set consists of question-and-answer pairs rather than a pre-existing dataset—doing things the right way can be extremely demanding and slow. Still, neglecting proper evaluation practices leads to a well-known machine learning pitfall: overfitting.

What exactly is overfitting?

Let’s take a brief step back and revisit some fundamentals of machine learning.

In machine learning, a model is constructed using data that is usually divided into a training set, a validation set, and a test set. To be more precise, the model is first fitted on the training set, which is the data that guides us in choosing the appropriate model type and tuning its parameters accordingly. At its most basic, the training set contains pairs of x and y data, and our objective is to derive a y = f(x) model that best fits the available x and y data.

After that’s complete, the trained model is used to generate predictions on the validation set. Specifically, for every x in the validation set, we produce a predicted y = f(x) using the chosen model, then measure how closely it aligns with the actual y from the validation set, and refine our model based on that comparison.

At the very end, once we’ve settled on which model we want to move forward with based on the validation step, we also run it against the test set. The purpose of the test set is to gauge how effectively the final model generalizes to completely new data by computing its scores, and this is precisely why the test set should only be used a single time.

We follow all these steps because our aim isn’t merely to fit the training set, but rather what the training set represents. This way, we can build models that internalize the underlying patterns sufficiently to make reliable predictions on fresh, unseen data (the test set).

Regrettably, we don’t always succeed in this, and instead of building models that capture the broader picture, we end up with models that merely conform to a narrow training set without generalizing. This is what’s known as overfitting. Consequently, the model delivers outstanding results on the training set, posting impressive numbers, but struggles when faced with anything unfamiliar.

The key insight here is that the test set only holds meaning if the model has truly never encountered it before. The instant you leverage it to make any decision about the model—even one that seems insignificant—you’ve tainted it and effectively folded it into the training set.

But after this brief detour into machine learning fundamentals, let’s circle back to our original water cooler claim.

Overfitting in RAG evaluation

This is where things become especially pertinent for those of us developing and assessing AI applications.

In my series on evaluating RAG pipelines, we spent considerable time discussing retrieval metrics: Precision@k, Recall@k, MRR, NDCG@k, and others. Yet, all those sophisticated metrics are only as meaningful as the evaluation set you apply them to. As it happens, the boundary between evaluation and test sets in RAG can become blurred surprisingly easily. I’d partly attribute this to the reality that, unlike a straightforward regression model, AI systems and RAG pipelines are far from intuitive to us. We have minimal genuine understanding of how the model is actually fitting the data, and as a result, we may go overboard and calibrate the system based on the test set without even realizing we’ve done so.

The team in our water cooler anecdote is doing precisely this. They spot issues during evaluation, address them, and reassess using the same question-answer pairs. Naturally, with each cycle, the evaluation scores climb because they’re essentially training the AI application on the test set.

Specifically, here are the most frequent ways this manifests in RAG:

Adjusting prompts based on the evaluation set: This is likely the most prevalent pattern, and it’s exactly what occurred in our water cooler scenario. You conduct an evaluation, observe that particular question categories consistently fail, and tweak your system prompt or retrieval logic to resolve them. Then you reassess on the identical set. Naturally, the scores get better; you might even achieve a flawless 100% score.
Selecting questions the system already handles well: A subtler variation of the same issue. When assembling an evaluation set, there’s a temptation to incorporate examples you already know the system handles effectively, particularly ones you’ve casually tested during development. Over time, the evaluation set gravitates toward the system’s strong points and moves further away from its weaknesses. The metrics appear excellent, but in truth, nobody knows what the real performance level is.
Crafting your test questions from the same documents you’ve indexed: If the questions

The evaluation questions in your test set were likely crafted by closely examining the documents already stored in your knowledge base, meaning they are subtly influenced by what you already know can be retrieved. Put another way, the questions were never truly independent of the underlying data, though this is especially difficult to notice since we express questions and answers in natural language rather than in simple numerical form like x and y.

The straightforward yet challenging remedy for all of these situations mirrors the classic machine learning approach: maintain a genuinely held-out test set that you consult as infrequently as possible, design your questions without reference to the system’s known behavior, and view suspiciously strong metrics with a healthy dose of doubt. A RAG system that excels on a small, meticulously curated, and repeatedly used evaluation set is much like a student who memorized past exam papers but is entirely unprepared for the first real question that doesn’t closely resemble the ones they’ve already practiced.

If you want to sanity-check your own RAG evaluation setup, here’s a brief list of questions worth reflecting on and answering honestly:

When I created my evaluation set, did I draft the questions independently of the documents in my knowledge base, or did I examine the documents first and craft questions I already knew could be answered?
Have I ever simply removed or swapped out a question from my evaluation set because the application kept getting it wrong?
Do I have a general sense of how my system handles questions it has never encountered before, or am I only familiar with its performance on the same fixed set I keep reusing?
Is there a portion of my evaluation set that has remained untouched and unseen by me for an extended period?

If you answered no to that last one, you might already be the team from today’s cautionary tale. 😉

Overfitting in Real Life: Goodhart’s Law

Goodhart’s Law, formulated by economist Charles Goodhart in 1975, is often expressed as a proverb along these lines:

When a measure becomes a target, it ceases to be a good measure.

This concept originally emerged from monetary policy, but it extends remarkably well far beyond economics, appearing almost anywhere a number is used to assess performance — such as KPIs, budgets, and all sorts of metrics. Consider a car salesman being rewarded for the number of cars sold each month, who then begins selling more cars even at a financial loss; hospitals attempting to shorten patient stays, who then end up discharging patients prematurely; citation counts on academic papers being manipulated, and so on.

All of these examples operate through the exact same underlying mechanism: a quantitative metric is introduced to monitor something important. For a time, the metric and the real thing move in tandem, and it feels safe to rely on the metric’s trajectory as a proxy for the real thing’s trajectory. Then people (or systems) begin optimizing directly for the metric rather than the underlying important objective, and the two quietly diverge. The metric starts improving without the underlying thing it was supposed to represent improving alongside it.

In the context of AI specifically, this failure mode is known as reward hacking, which happens when an AI system optimizes a poorly defined reward signal without actually achieving the intended goal. Similarly, in classical machine learning, overfitting is what befalls a model when the training signal no longer captures the true underlying pattern. Goodhart’s Law is what happens to us, the humans designing the system, when our evaluation signal no longer reflects what we genuinely care about.

On my mind

What I find most fascinating about overfitting, particularly in RAG applications, is that it isn’t really a technical problem at all. It is fundamentally a problem of understanding and adhering to the right process. It is tempting to compromise that process and optimize directly for the scores, especially with RAG datasets that don’t look quite like the datasets we’re accustomed to in classical machine learning.

That said, this pattern surfaces far beyond machine learning and AI. In everyday life and in machine learning alike, the antidote is the same: remain consistent and never lose sight of the actual objective you are pursuing. In ML and AI, that objective is for the model to genuinely function and produce meaningful results once deployed and exposed to real-world data — not merely to achieve impressive scores during evaluation.

The team in our cautionary tale isn’t doing anything malicious. On the contrary, their actions feel responsible — carefully refining the application based on evaluation outcomes. And that is precisely what makes overfitting so dangerous. It doesn’t look like a mistake while it’s unfolding. It only appears as one in retrospect, once the system encounters the real world and the scores stop holding up.

✨ Thank you for reading! ✨

If you made it this far, you might find pialgorithms useful — a platform we’ve been building that helps teams securely manage organizational knowledge in one place.

Loved this post? Join me on 💌Substack and 💼LinkedIn

All images by the author, unless otherwise noted

Top Posts

Beyond Anti-Fraud Bills: Why the Payment Design Layer Demands a Statutory Foundation

Quectel Unleashes Next-Gen IoT with NXP-Powered Wi-Fi 6, Thread, and Zigbee Multi-Protocol Module

Inside the RAG Evaluation Trap: When Your Metrics Lie to You

Inside the RAG Evaluation Trap: When Your Metrics Lie to You

Reward Hacking Inflates SWE-bench Pro Scores, Cursor Study Reveals

Harnessing Apple Silicon: Mastering Language Model Fine-Tuning with MLX

Rewriting the Rules: The UN-Led Uprising to Dethrone America’s Cloud Titans With Open-Source Power

Synchronizing Commerce with SAP: Unlocking AI-Driven Personalization

The Expert Amplifier: A Philosophy for Building Enterprise RAG

GBDTs Dominate the Hot Path, Agents Rule the Cold Path: A Payment-Fraud Benchmark

Beyond Anti-Fraud Bills: Why the Payment Design Layer Demands a Statutory Foundation

Quectel Unleashes Next-Gen IoT with NXP-Powered Wi-Fi 6, Thread, and Zigbee Multi-Protocol Module

Inside the RAG Evaluation Trap: When Your Metrics Lie to You

“Botanix’s Flop: The Final Nail in Bitcoin DeFi’s Coffin?”

Cracking the Code: Can FedRAMP 20x Finally Solve the GRC Crisis?

“From Solo Thinker to Connected Doer: The Evolution of Local LLMs into Tool-Using Agents”

House Committee Exposes Rampant Tax Cheating Within Federal Ranks

Last Call: Grab the $200 Ninja Slushi at Best Buy Before It’s Gone for Good

Trending

Beyond Anti-Fraud Bills: Why the Payment Design Layer Demands a Statutory Foundation

Quectel Unleashes Next-Gen IoT with NXP-Powered Wi-Fi 6, Thread, and Zigbee Multi-Protocol Module

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Inside the RAG Evaluation Trap: When Your Metrics Lie to You

What exactly is overfitting?

Overfitting in RAG evaluation

Overfitting in Real Life: Goodhart’s Law

On my mind

Related Posts