I created a dataset Q&A tool and placed too much faith in a RAG response that was barely accurate.
I evaluated this across 7 different query types and 5 context sizes over 100,000 data rows.
Some query types abandoned RAG altogether.
I Trusted the Wrong Number
Last month I was deep into developing a new feature for EmiTechLogic. Learners can now upload their own messy CSV files and ask questions in plain English about their data. It seemed like the perfect use case for RAG, so I went all in — embeddings, retrieval, polished responses.
The initial demos looked fantastic. Clean tables, confident figures, professional formatting. I actually started relying on the system during our internal testing.
Then I decided to verify a single number.
Actual grocery spend in the dataset: $1,140,033.24.
The model delivered a neat breakdown by category. It appeared legitimate. I totaled the numbers it returned.
It came to less than half.
I sat there staring at the screen thinking “this can’t be right.” So I did what any engineer would do. I expanded the context window. 4k… 16k… 32k… 128k tokens. Each time the answer grew longer, more detailed, and more confidently incorrect.
That’s when it finally hit me. This wasn’t a retrieval problem. I was asking a retrieval system to do heavy computation on data it had only partially seen. And instead of admitting it was unsure or missing information, the model was generating polished, structured answers that appeared correct.
Why RAG Can’t Aggregate
The RAG pipeline doesn’t truly comprehend structured data. All it does is take each CSV row and convert it into plain text. That’s it. To the model, a row looks something like this:
"2019-01-01 grocery_pos 107.23 F NC Jennifer Banks ..."
For a query like “What is the total spend by category?”, the RAG pipeline works like this:
1. Tokenise: ["total", "spend", "category"]
2. Score all 100,000 rows by keyword overlap
3. Return the top-N rows as serialised plain text
4. Ask the LLM to sum and group from that text
Step 4 is where things fall apart. The LLM isn’t executing a SUM operation. It’s pattern-matching numbers from a text blob and producing a response that resembles an aggregation.
Models have difficulty with numerical precision at scale [1], but the real problem lies in the presentation. The model provides a detailed breakdown across all categories. This is a classic pitfall. The output looks professional. It replicates the structure of a genuine report so convincingly that your brain assumes the content is accurate. You have no way to confirm that 92% of your data is absent.
RAG is a retrieval mechanism. It’s not a computational engine. Retrieval locates relevant fragments. Computation demands a complete dataset scan. When you use RAG for math, you get an incorrect answer that appears authoritative. That difference matters enormously. A partial answer signals that information is missing. A complete-looking wrong answer simply signals false confidence.
Complete code:
The Benchmark: Two Pipelines, Same Query
To measure this precisely, I created a benchmark that runs two pipelines simultaneously for every query.
The first pipeline is a RAG simulation. It models what a basic vector pipeline sends to an LLM at five different context sizes. I tested five context sizes, ranging from 5 rows up to 8,000. That spans from 325 tokens to approximately 500,000. For each size, I monitored three metrics: how much data the LLM sees, what sum it computes from that particular slice, and whether a user could actually detect the error.
The second pipeline is a semantic engine that runs the same query as a deterministic full-scan across all 100,000 rows and delivers the exact correct answer.
Architectural comparison of query processing workflows, contrasting text-based RAG Simulation retrieval with structured data aggregation in a Semantic Engine. Image by Author.
The simulation doesn’t replicate exact LLM outputs. What it maintains is the key structural characteristic: a partial data slice fed into a system that returns a complete-form answer. That’s the characteristic responsible for the problem, and that’s what the benchmark evaluates.
I chose seven query types to cover every aggregation pattern a structured data system is likely to face:
Query
Operation
Why it breaks RAG
Total spend by category
SUM + GROUP BY
Requires summing all rows across 14 groups
Highest average transaction by category
AVG + GROUP BY
Average changes with every missing row
Total spent on grocery_pos
SUM + categorical filter
Filter requires seeing all matching rows
How many female customers transacted
COUNT + filter
Count is meaningless on a partial scan
Total spend where amount > $500
SUM + numeric comparison
Threshold logic requires full data
State with lowest total spending
MIN + GROUP BY across 50 groups
Minimum can only be found with all groups present
Percentage of transactions that are fraudulent
COUNT + ratio
Ratio is undefined on a partial denominator
These queries aren’t unusual or complex. They’re the standard questions any analyst asks when examining a new dataset. That’s precisely why this failure is so critical.
Error Observability Collapse
Here’s the complete benchmark output for the query that started all of this. I’m showing it in full because the numbers make the problem impossible to ignore.
GROUND TRUTH (Semantic Engine)
SUM(amt) GROUP BY category → 14 groups
#1 grocery_pos 1,140,033.24
#2 shopping_net 773,527.93
#3 shopping_pos 725,766.14
#4 gas_transport 648,804.24
#5 home 556,526.53
Latency: 100.47ms | Rows scanned: 100,000
RAG SIMULATION — what the LLM receives at each context size
Context Rows Coverage Partial sum
Error detectable?
tiny (~325 tokens) 5 0.0050% 197.73 EASY
small (~3K tokens) 50 0.0500% 2,003.56 MODERATE
medium (~32K tokens) 500 0.5000% 31,023.21 HARD
large (~130K tokens) 2,000 2.0000% 140,093.16 VERY HARD
xlarge (~520K tokens) 8,000 8.0000% 569,368.22 NEAR IMPOSSIBLE
I sat with these results for some time. What disturbed me most wasn't simply that the model's answers were incorrect — it was how dramatically harder those mistakes became to catch as the context window expanded.
With 8,000 rows, the error still exceeded 50%, yet the output resembled a polished professional report. You'd have to cross-check the figures by hand to realize something was wrong. This is what I began referring to as Error Observability Collapse: the larger the context fed into the model, the more authoritative — though not more accurate — its responses appeared.
"Partial sum" column represents what you'd get if the language model simply added up every amount field from the rows it actually managed to retrieve. The "Error detectable?" column rates how likely a human reader is to notice a mistake.
With 5 rows, the partial sum comes to 197.73. The real total is 1,140,033.24. It's blatantly obvious. The answer is short, the math is completely wrong, and missing data is glaring. The error practically announces itself.
At 8,000 rows, the partial sum reaches 569,368.22. By now the model has encountered all 14 categories. It crafts a 1,500-word report packed with specific figures and assured language. The error is 50%, yet it sits buried inside fluid, well-organized prose. Without any external data to compare against, a reader has no way to spot it.
This pattern remained consistent across all seven test queries:
Context Window
Rows
Dataset Coverage
Response Length
Error Detectable?
~325 tokens
5
0.005%
~50 words
YES — clearly a guess
~3K tokens
50
0.050%
~150 words
MAYBE
~32K tokens
500
0.500%
~400 words
HARD
~130K tokens
2,000
2.000%
~800 words
VERY HARD
~520K tokens
8,000
8.000%
~1,500 words
NEAR IMPOSSIBLE
Semantic Engine
100,000
100%
<200ms
N/A — exact
I named this Error Observability Collapse. As the volume of context increases, the model's surface confidence grows in lockstep. Its actual correctness doesn't budge.
The illusion of context: How larger context windows in RAG and LLM systems raise user confidence and reduce error detectability without improving any real accuracy. Image by Author.
The two types of failures are asymmetric, and that asymmetry is what makes them treacherous:
An incorrect RAG response appears correct. It's well structured, precise, and stated with conviction. A failed computation, on the other hand, produces an explicit error message. It's visible and obvious.
One failure stays completely silent. The other is loud and unmistakable. As context windows stretch toward millions of tokens, that silent failure only gets harder to uncover [4]. The system doesn't become safer as it grows. It just becomes more persuasive.
The Semantic Engine: Proof That the Correct Answer Is Fast
Before I'd fully grasped the problem, I'd already thrown together a rudimentary semantic engine out of sheer frustration. I just wanted to get a correct answer — at least once.
The approach ended up being straightforward: deconstruct the query into proper operations and execute a single sweep across the entire dataset. No embeddings, no retrieval, no guesswork.
Here's how that plays out in practice:
The logic is simple. Consider a question like "What is the total spend by category?" The engine translates this directly into an operation: SUM(amt) GROUP BY category. It scans the full 100,000-row dataset in one clean pass. Running up cumulative grouped totals. There's no retrieval step. No inferential reasoning. No partial scanning. It hits every row exactly once, then delivers the precise result.
This shows that getting the right answer doesn't have to be costly. Benchmark queries finished in under 200ms. Sample size: 100,000 rows. Aggregation is a trivial operation. The failure occurs the moment you route those queries to a system designed to misinterpret them.
The engine supports SUM, AVG, COUNT, MIN, and MAX. It handles both categorical and numeric filters. It includes GROUP BY and ratio operations. Zero external dependencies. Every operation behaves as a deterministic function applied to the entire list.
The engine itself isn't the product — it's the proof. The correct answer arrives in under one second. No inference needed. The real obstacle is finding a reliable way to route queries to it.
The Fix Is Not Better Retrieval
Stop trying to refine retrieval. If a query demands all the data, then an 8% sample will always fail. The real solution is to take retrieval out of the equation entirely.
What's needed is a classification layer. It sits in front of the pipeline and makes a single binary decision: does this query require computation, or does it need a lookup?
The distinction is straightforward. "Total spend by category" requires scanning the entire dataset. "Find transactions from Jennifer Banks" is a simple filtered lookup. Standard RAG forces both queries through the same pipeline. That's the fundamental design flaw.
A QueryRouter corrects this. It examines every query it receives and directs it to the right path — before retrieval ever starts.
Intent-based query routing architecture, separating analytical calculation intents from semantic information retrieval pipelines. Image by Author.
The classifier relies on three prioritized signal tiers. Tier 1 consists of aggregation verbs — total, how many, average, lowest, percentage. These signal that the query requires full-dataset computation.
Tier 2: numeric comparisons — greater than 500, above $1,000, at least. These call for filtering before aggregation, which RAG simply cannot handle. Tier 3: retrieval signals — find, show me, list, fetch. These point to lookups where semantic similarity does the job well.
Tier
Signal Type
Examples
Routing Destination
1
Aggregation verb
total, how many, average, lowest, percentage
COMPUTATION
2
Numeric comparison
greater than 500, above $1,000, at least
COMPUTATION
3
Retrieval signal
find, show me, list, fetch
RETRIEVAL
0
No match
ambiguous
COMPUTATION — safer default
When no tier matches, fall back to COMPUTATION. This is an intentional choice. The failure modes are unbalanced: an incorrect RAG response to an aggregation question goes undetected silently. A computation engine that fails to parse a query raises a visible error. When uncertain, fail in a way that's obvious.
from query_router import QueryRouter
router = QueryRouter(rows)
result = router.route("What is the total spend by category?")
# result.routed_to → "COMPUTATION"
# result.answer.answer → [(grocery_pos, 1140033.24), ...]
# result.total_latency → ~250ms — classify + execute combined
result = router.route("Find transactions from Jennifer Banks")
# result.routed_to → "RETRIEVAL"
# result.answer.safe → True — RAG is appropriate
# result.total_latency → ~110ms
Benchmark Routing Results
Nine queries were run through the router to validate performance across both routing paths: seven aggregation queries directed to the semantic computation engine, and two lookup queries routed to RAG.
Every single route was accurate. The seven aggregation queries were sent to the full-scan engine and delivered precise results. The two lookup queries were correctly dispatched to the RAG pipeline. Observe the output: strong confidence scores, accurate pattern recognition, and response times staying below 130ms — even processing the full 100,000-row dataset.
[1] ✓ COMPUTATION "What is the total spend by category?"
Tier 1 | matched='total' | confidence=0.97
#1 grocery_pos 1,140,033.24 (102.57ms | 100,000 rows | exact)
[2] ✓ COMPUTATION "Which category has the highest average transaction amount?"
Tier 1 | matched='highest' | confidence=0.97
71.91 (119.47ms | 100,000 rows | exact)
[3] ✓ COMPUTATION "What is the total amount spent on grocery_pos?"
Tier 1 | matched='total' | confidence=0.97
1,140,033.24 (49.96ms | 100,000 rows | exact)
[4] ✓ COMPUTATION "How many transactions were made by female customers?"
Tier 1 | matched='How many' | confidence=0.97
54,641.00 (90.45ms | 100,000 rows | exact)
[5] ✓ COMPUTATION "What is the total spend where amount is greater than 500?"
Tier 1 | matched='total' | confidence=0.97
1,274,269.60 (91.65ms | 100,000 rows | exact)
[6] ✓ COMPUTATION "Which state has the lowest total spending?"
Tier 1 | matched='lowest' | confidence=0.97
lowest RI 2,125.60 (109.05ms | 100,000 rows | exact)
[7] ✓ COMPUTATION "What percentage of transactions are fraudulent?"
Tier 1 | matched='percentage' | confidence=0.97
0.9900% (87.35ms | 100,000 rows | exact)
[8] ✓ RETRIEVAL "Find transactions from Jennifer Banks"
Tier 3 | matched='Find' | confidence=0.85
RAG is appropriate — no aggregation required
[9] ✓ RETRIEVAL "Show me a sample transaction from Texas"
Tier 3 | matched='Show me' | confidence=0.85
RAG is appropriate — no aggregation required
Routing accuracy: 9/9
9/9 routed correctly. Error Observability Collapse is entirely prevented as long as aggregation queries never reach the RAG pipeline.
Comprehensive Test Coverage
The benchmark validates nine targeted queries. The test suite ensures dependability across a much wider set: edge cases, malformed inputs, missing values, and the common failure points seen in production environments.
The engine suite contains 87 tests spanning 10 test classes. It covers float parsing with currency symbols, commas, and scientific notation; all aggregation functions — sum, count, average, minimum maximum, and percentage — under both normal conditions and with empty inputs; all five numeric filter operators (greater than, less than, at least, at most, equal); full GROUP BY aggregation combining categorical and numerical filters; RAG simulation coverage at each context window size; and edge cases including empty datasets, rows with missing values, and single-row inputs.
The router suite contains 72 tests spanning 5 test classes. It covers all three tier patterns, including edge cases such as all-uppercase and very long queries; translation of natural language queries into typed operations for every supported format; routing and execution correctness tested against all seven benchmark aggregation queries; and a cross-validation suite that confirms router answers independently match ground-truth computations — verifying the router introduces zero deviation from the engine's own results.
To run the engine tests, type python -m unittest test_engine -v. This will execute all 87 tests in the engine suite.
To run the router tests, type python -m unittest test_router -v. This will execute all 72 tests in the router suite.
All 159 tests pass on Python 3.9+ with zero third-party dependencies.
Known Limitations
This approach isn't flawless. It currently operates only on single CSV files. Real-world production datasets are typically far messier, often spanning multiple tables that require joins — I intentionally narrowed the scope because I prioritized building something that actually worked end to end.
The router is also still relatively simple, relying on regular expression matching. I experimented with a lightweight LLM-based classifier early in development, but it proved inconsistent and added noticeable latency, so I reverted to the simpler approach. Occasionally the straightforward solution is the right one.
RAG responses were also simulated rather than generated by live API calls for this benchmark. The patterns should hold up, but actual results with GPT-4o or Claude 3.5 may vary slightly.
CSV format is mandatory. The engine reads structured data directly from CSV files. Database connections, Parquet files, and other structured formats are not currently supported.
Impact
Adding a routing layer costs almost nothing. Matching a query against 65 regex patterns takes mere microseconds. The semantic computation engine processes a 100,000-row dataset in under 200ms. The combined overhead is less than a single embedding API call.
What this buys you is a determinate answer for every aggregation query. Every total, every count, every percentage is computed from a complete scan rather than a confident-sounding estimate built on just 8 percent of the data. RAG continues to handle what it's genuinely suited for: fetching specific records, identifying relevant passages, and answering lookup-style questions where semantic similarity is the right mechanism for finding answers.
RAG is not broken.
Here's the paraphrased version with improved readability while preserving the exact same HTML structure and content meaning:
---
It is just being asked to compute, and it cannot do that. The dangerous part is not that it fails. It is that it fails convincingly. And no amount of context changes that.
You can try typing it out like this:
First, clone the repository by running `git clone` followed by the URL . After completion, switch into the folder with `cd context-window-engine`. Then, launch the demo by executing `python demo.py` in your terminal.
References
[1] Levy, M., Jacoby, A., & Goldberg, Y. (2024). Same task, more tokens: The impact of input length on the reasoning performance of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, Bangkok, Thailand. Association for Computational Linguistics.
[2] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
[3] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
[4] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
[5] Koshorek, O., Granot, N., Alloni, A., Admati, S., Hendel, R., Weiss, I., Arazi, A., Cohen, S.-N., & Belinkov, Y. (2025). Structured RAG for answering aggregative questions. arXiv preprint arXiv:2511.08505.
Disclosure
All benchmark numbers come from real runs on Python 3.12.6, Windows 11, using only CPU (no GPU). The test data comes from the Credit Card Transactions Fraud Detection dataset (Kartik Gajjar, Kaggle, 2020), a synthetic set created with the Sparkov transaction simulator by Brandon Harris, under CC0 (Public Domain), available at kaggle.com/datasets/kartik2112/fraud-detection. The RAG baseline simulates retrieval and confidence scoring — no actual LLM API calls are made. No outside API keys are needed to recreate any result in this article. All code described was created and verified by me.