Bigger Context Isn't The Answer For RAG — Here's The System That Actually Solves It

TL;DR

I created a dataset Q&A tool and placed too much faith in a RAG response that was barely accurate.
I evaluated this across 7 different query types and 5 context sizes over 100,000 data rows.
Some query types abandoned RAG altogether.

I Trusted the Wrong Number

Last month I was deep into developing a new feature for EmiTechLogic. Learners can now upload their own messy CSV files and ask questions in plain English about their data. It seemed like the perfect use case for RAG, so I went all in — embeddings, retrieval, polished responses.

The initial demos looked fantastic. Clean tables, confident figures, professional formatting. I actually started relying on the system during our internal testing.

Then I decided to verify a single number.

Actual grocery spend in the dataset: $1,140,033.24.

The model delivered a neat breakdown by category. It appeared legitimate. I totaled the numbers it returned.

It came to less than half.

I sat there staring at the screen thinking “this can’t be right.” So I did what any engineer would do. I expanded the context window. 4k… 16k… 32k… 128k tokens. Each time the answer grew longer, more detailed, and more confidently incorrect.

That’s when it finally hit me. This wasn’t a retrieval problem. I was asking a retrieval system to do heavy computation on data it had only partially seen. And instead of admitting it was unsure or missing information, the model was generating polished, structured answers that appeared correct.

Why RAG Can’t Aggregate

The RAG pipeline doesn’t truly comprehend structured data. All it does is take each CSV row and convert it into plain text. That’s it. To the model, a row looks something like this:

"2019-01-01 grocery_pos 107.23 F NC Jennifer Banks ..."

For a query like “What is the total spend by category?”, the RAG pipeline works like this:

1. Tokenise: ["total", "spend", "category"]
2. Score all 100,000 rows by keyword overlap
3. Return the top-N rows as serialised plain text
4. Ask the LLM to sum and group from that text

Step 4 is where things fall apart. The LLM isn’t executing a SUM operation. It’s pattern-matching numbers from a text blob and producing a response that resembles an aggregation.

Models have difficulty with numerical precision at scale [1], but the real problem lies in the presentation. The model provides a detailed breakdown across all categories. This is a classic pitfall. The output looks professional. It replicates the structure of a genuine report so convincingly that your brain assumes the content is accurate. You have no way to confirm that 92% of your data is absent.

RAG is a retrieval mechanism. It’s not a computational engine. Retrieval locates relevant fragments. Computation demands a complete dataset scan. When you use RAG for math, you get an incorrect answer that appears authoritative. That difference matters enormously. A partial answer signals that information is missing. A complete-looking wrong answer simply signals false confidence.

Complete code:

The Benchmark: Two Pipelines, Same Query

To measure this precisely, I created a benchmark that runs two pipelines simultaneously for every query.

The first pipeline is a RAG simulation. It models what a basic vector pipeline sends to an LLM at five different context sizes. I tested five context sizes, ranging from 5 rows up to 8,000. That spans from 325 tokens to approximately 500,000. For each size, I monitored three metrics: how much data the LLM sees, what sum it computes from that particular slice, and whether a user could actually detect the error.

The second pipeline is a semantic engine that runs the same query as a deterministic full-scan across all 100,000 rows and delivers the exact correct answer.

Architectural comparison of query processing workflows, contrasting text-based RAG Simulation retrieval with structured data aggregation in a Semantic Engine. Image by Author.

The simulation doesn’t replicate exact LLM outputs. What it maintains is the key structural characteristic: a partial data slice fed into a system that returns a complete-form answer. That’s the characteristic responsible for the problem, and that’s what the benchmark evaluates.

I chose seven query types to cover every aggregation pattern a structured data system is likely to face:

Query	Operation	Why it breaks RAG
Total spend by category	SUM + GROUP BY	Requires summing all rows across 14 groups
Highest average transaction by category	AVG + GROUP BY	Average changes with every missing row
Total spent on grocery_pos	SUM + categorical filter	Filter requires seeing all matching rows
How many female customers transacted	COUNT + filter	Count is meaningless on a partial scan
Total spend where amount > $500	SUM + numeric comparison	Threshold logic requires full data
State with lowest total spending	MIN + GROUP BY across 50 groups	Minimum can only be found with all groups present
Percentage of transactions that are fraudulent	COUNT + ratio	Ratio is undefined on a partial denominator

These queries aren’t unusual or complex. They’re the standard questions any analyst asks when examining a new dataset. That’s precisely why this failure is so critical.

Error Observability Collapse

Here’s the complete benchmark output for the query that started all of this. I’m showing it in full because the numbers make the problem impossible to ignore.

GROUND TRUTH (Semantic Engine)
SUM(amt) GROUP BY category → 14 groups
  #1  grocery_pos               1,140,033.24
  #2  shopping_net                773,527.93
  #3  shopping_pos                725,766.14
  #4  gas_transport               648,804.24
  #5  home                        556,526.53
Latency: 100.47ms | Rows scanned: 100,000

RAG SIMULATION — what the LLM receives at each context size

Context               Rows   Coverage    Partial sum
Error detectable?
tiny   (~325 tokens)     5   0.0050%         197.73  EASY
small  (~3K tokens)     50   0.0500%       2,003.56  MODERATE
medium (~32K tokens)   500   0.5000%      31,023.21  HARD
large  (~130K tokens) 2,000  2.0000%     140,093.16  VERY HARD
xlarge (~520K tokens) 8,000  8.0000%     569,368.22  NEAR IMPOSSIBLE
I sat with these results for some time. What disturbed me most wasn't simply that the model's answers were incorrect — it was how dramatically harder those mistakes became to catch as the context window expanded.
With 8,000 rows, the error still exceeded 50%, yet the output resembled a polished professional report. You'd have to cross-check the figures by hand to realize something was wrong. This is what I began referring to as Error Observability Collapse: the larger the context fed into the model, the more authoritative — though not more accurate — its responses appeared.
"Partial sum" column represents what you'd get if the language model simply added up every amount field from the rows it actually managed to retrieve. The "Error detectable?" column rates how likely a human reader is to notice a mistake.
With 5 rows, the partial sum comes to 197.73. The real total is 1,140,033.24. It's blatantly obvious. The answer is short, the math is completely wrong, and missing data is glaring. The error practically announces itself.
At 8,000 rows, the partial sum reaches 569,368.22. By now the model has encountered all 14 categories. It crafts a 1,500-word report packed with specific figures and assured language. The error is 50%, yet it sits buried inside fluid, well-organized prose. Without any external data to compare against, a reader has no way to spot it.
This pattern remained consistent across all seven test queries:
Context Window Rows Dataset Coverage Response Length Error Detectable?
~325 tokens 5 0.005% ~50 words YES — clearly a guess
~3K tokens 50 0.050% ~150 words MAYBE
~32K tokens 500 0.500% ~400 words HARD
~130K tokens 2,000 2.000% ~800 words VERY HARD
~520K tokens 8,000 8.000% ~1,500 words NEAR IMPOSSIBLE
Semantic Engine 100,000 100% <200ms N/A — exact
I named this Error Observability Collapse. As the volume of context increases, the model's surface confidence grows in lockstep. Its actual correctness doesn't budge.
The illusion of context: How larger context windows in RAG and LLM systems raise user confidence and reduce error detectability without improving any real accuracy. Image by Author.
The two types of failures are asymmetric, and that asymmetry is what makes them treacherous:
An incorrect RAG response appears correct. It's well structured, precise, and stated with conviction. A failed computation, on the other hand, produces an explicit error message. It's visible and obvious.
One failure stays completely silent. The other is loud and unmistakable. As context windows stretch toward millions of tokens, that silent failure only gets harder to uncover [4]. The system doesn't become safer as it grows. It just becomes more persuasive.
The Semantic Engine: Proof That the Correct Answer Is Fast
Before I'd fully grasped the problem, I'd already thrown together a rudimentary semantic engine out of sheer frustration. I just wanted to get a correct answer — at least once.
The approach ended up being straightforward: deconstruct the query into proper operations and execute a single sweep across the entire dataset. No embeddings, no retrieval, no guesswork.
Here's how that plays out in practice:
The logic is simple. Consider a question like "What is the total spend by category?" The engine translates this directly into an operation: SUM(amt) GROUP BY category. It scans the full 100,000-row dataset in one clean pass. Running up cumulative grouped totals. There's no retrieval step. No inferential reasoning. No partial scanning. It hits every row exactly once, then delivers the precise result.
This shows that getting the right answer doesn't have to be costly. Benchmark queries finished in under 200ms. Sample size: 100,000 rows. Aggregation is a trivial operation. The failure occurs the moment you route those queries to a system designed to misinterpret them.
from context_window_engine import compute_ground_truth, load_csv

rows = load_csv("data/credit_card_transactions.csv", max_rows=100_000)

gt = compute_ground_truth(
    query_label = "total by category",
    rows        = rows,
    agg_func    = "sum",
    agg_col     = "amt",
    group_col   = "category",
)
# gt.answer     → [(grocery_pos, 1140033.24), (shopping_net, 773527.93), ...]
# gt.latency_ms → 100.47
The engine supports SUM, AVG, COUNT, MIN, and MAX. It handles both categorical and numeric filters. It includes GROUP BY and ratio operations. Zero external dependencies. Every operation behaves as a deterministic function applied to the entire list.
The engine itself isn't the product — it's the proof. The correct answer arrives in under one second. No inference needed. The real obstacle is finding a reliable way to route queries to it.
The Fix Is Not Better Retrieval
Stop trying to refine retrieval. If a query demands all the data, then an 8% sample will always fail. The real solution is to take retrieval out of the equation entirely.
What's needed is a classification layer. It sits in front of the pipeline and makes a single binary decision: does this query require computation, or does it need a lookup?
The distinction is straightforward. "Total spend by category" requires scanning the entire dataset. "Find transactions from Jennifer Banks" is a simple filtered lookup. Standard RAG forces both queries through the same pipeline. That's the fundamental design flaw.
A QueryRouter corrects this. It examines every query it receives and directs it to the right path — before retrieval ever starts.
Intent-based query routing architecture, separating analytical calculation intents from semantic information retrieval pipelines. Image by Author.
The classifier relies on three prioritized signal tiers. Tier 1 consists of aggregation verbs — total, how many, average, lowest, percentage. These signal that the query requires full-dataset computation.
Tier 2: numeric comparisons — greater than 500, above $1,000, at least. These call for filtering before aggregation, which RAG simply cannot handle. Tier 3: retrieval signals — find, show me, list, fetch. These point to lookups where semantic similarity does the job well.
Tier Signal Type Examples Routing Destination
1 Aggregation verb total, how many, average, lowest, percentage COMPUTATION
2 Numeric comparison greater than 500, above $1,000, at least COMPUTATION
3 Retrieval signal find, show me, list, fetch RETRIEVAL
0 No match ambiguous COMPUTATION — safer default
When no tier matches, fall back to COMPUTATION. This is an intentional choice. The failure modes are unbalanced: an incorrect RAG response to an aggregation question goes undetected silently. A computation engine that fails to parse a query raises a visible error. When uncertain, fail in a way that's obvious.
from query_router import QueryRouter

router = QueryRouter(rows)
result = router.route("What is the total spend by category?")
# result.routed_to     → "COMPUTATION"
# result.answer.answer → [(grocery_pos, 1140033.24), ...]
# result.total_latency → ~250ms — classify + execute combined

result = router.route("Find transactions from Jennifer Banks")
# result.routed_to     → "RETRIEVAL"
# result.answer.safe   → True — RAG is appropriate
# result.total_latency → ~110ms
Benchmark Routing Results
Nine queries were run through the router to validate performance across both routing paths: seven aggregation queries directed to the semantic computation engine, and two lookup queries routed to RAG.
Every single route was accurate. The seven aggregation queries were sent to the full-scan engine and delivered precise results. The two lookup queries were correctly dispatched to the RAG pipeline. Observe the output: strong confidence scores, accurate pattern recognition, and response times staying below 130ms — even processing the full 100,000-row dataset.
[1] ✓  COMPUTATION   "What is the total spend by category?"
     Tier 1 | matched='total' | confidence=0.97
     #1 grocery_pos      1,140,033.24  (102.57ms | 100,000 rows | exact)

[2] ✓  COMPUTATION   "Which category has the highest average transaction amount?"
     Tier 1 | matched='highest' | confidence=0.97
                               71.91  (119.47ms | 100,000 rows | exact)

[3] ✓  COMPUTATION   "What is the total amount spent on grocery_pos?"
     Tier 1 | matched='total' | confidence=0.97
                        1,140,033.24  (49.96ms  | 100,000 rows | exact)

[4] ✓  COMPUTATION   "How many transactions were made by female customers?"
     Tier 1 | matched='How many' | confidence=0.97
                           54,641.00  (90.45ms  | 100,000 rows | exact)

[5] ✓  COMPUTATION   "What is the total spend where amount is greater than 500?"
     Tier 1 | matched='total' | confidence=0.97
                        1,274,269.60  (91.65ms  | 100,000 rows | exact)

[6] ✓  COMPUTATION   "Which state has the lowest total spending?"
     Tier 1 | matched='lowest' | confidence=0.97
     lowest RI               2,125.60  (109.05ms | 100,000 rows | exact)

[7] ✓  COMPUTATION   "What percentage of transactions are fraudulent?"
     Tier 1 | matched='percentage' | confidence=0.97
                              0.9900%  (87.35ms  | 100,000 rows | exact)

[8] ✓  RETRIEVAL     "Find transactions from Jennifer Banks"
     Tier 3 | matched='Find' | confidence=0.85
     RAG is appropriate — no aggregation required

[9] ✓  RETRIEVAL     "Show me a sample transaction from Texas"
     Tier 3 | matched='Show me' | confidence=0.85
     RAG is appropriate — no aggregation required

Routing accuracy: 9/9
9/9 routed correctly. Error Observability Collapse is entirely prevented as long as aggregation queries never reach the RAG pipeline.
Comprehensive Test Coverage
The benchmark validates nine targeted queries. The test suite ensures dependability across a much wider set: edge cases, malformed inputs, missing values, and the common failure points seen in production environments.
The engine suite contains 87 tests spanning 10 test classes. It covers float parsing with currency symbols, commas, and scientific notation; all aggregation functions — sum, count, average, minimum maximum, and percentage — under both normal conditions and with empty inputs; all five numeric filter operators (greater than, less than, at least, at most, equal); full GROUP BY aggregation combining categorical and numerical filters; RAG simulation coverage at each context window size; and edge cases including empty datasets, rows with missing values, and single-row inputs.
The router suite contains 72 tests spanning 5 test classes. It covers all three tier patterns, including edge cases such as all-uppercase and very long queries; translation of natural language queries into typed operations for every supported format; routing and execution correctness tested against all seven benchmark aggregation queries; and a cross-validation suite that confirms router answers independently match ground-truth computations — verifying the router introduces zero deviation from the engine's own results.
To run the engine tests, type python -m unittest test_engine -v. This will execute all 87 tests in the engine suite.
To run the router tests, type python -m unittest test_router -v. This will execute all 72 tests in the router suite.
All 159 tests pass on Python 3.9+ with zero third-party dependencies.
Known Limitations
This approach isn't flawless. It currently operates only on single CSV files. Real-world production datasets are typically far messier, often spanning multiple tables that require joins — I intentionally narrowed the scope because I prioritized building something that actually worked end to end.
The router is also still relatively simple, relying on regular expression matching. I experimented with a lightweight LLM-based classifier early in development, but it proved inconsistent and added noticeable latency, so I reverted to the simpler approach. Occasionally the straightforward solution is the right one.
RAG responses were also simulated rather than generated by live API calls for this benchmark. The patterns should hold up, but actual results with GPT-4o or Claude 3.5 may vary slightly.
CSV format is mandatory. The engine reads structured data directly from CSV files. Database connections, Parquet files, and other structured formats are not currently supported.
Impact
Adding a routing layer costs almost nothing. Matching a query against 65 regex patterns takes mere microseconds. The semantic computation engine processes a 100,000-row dataset in under 200ms. The combined overhead is less than a single embedding API call.
What this buys you is a determinate answer for every aggregation query. Every total, every count, every percentage is computed from a complete scan rather than a confident-sounding estimate built on just 8 percent of the data. RAG continues to handle what it's genuinely suited for: fetching specific records, identifying relevant passages, and answering lookup-style questions where semantic similarity is the right mechanism for finding answers.
RAG is not broken.
Here's the paraphrased version with improved readability while preserving the exact same HTML structure and content meaning:
---
It is just being asked to compute, and it cannot do that.
The dangerous part is not that it fails. It is that it fails convincingly. And no amount of context changes that.
You can try typing it out like this:
First, clone the repository by running `git clone` followed by the URL . After completion, switch into the folder with `cd context-window-engine`. Then, launch the demo by executing `python demo.py` in your terminal.
References
[1] Levy, M., Jacoby, A., & Goldberg, Y. (2024). Same task, more tokens: The impact of input length on the reasoning performance of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, Bangkok, Thailand. Association for Computational Linguistics.
[2] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
[3] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J.,
Guo, Q., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
[4] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. 
[5] Koshorek, O., Granot, N., Alloni, A., Admati, S., Hendel, R., Weiss, I., Arazi, A., Cohen, S.-N., & Belinkov, Y. (2025). Structured RAG for answering aggregative questions. arXiv preprint arXiv:2511.08505.
Disclosure
All benchmark numbers come from real runs on Python 3.12.6, Windows 11, using only CPU (no GPU). The test data comes from the Credit Card Transactions Fraud Detection dataset (Kartik Gajjar, Kaggle, 2020), a synthetic set created with the Sparkov transaction simulator by Brandon Harris, under CC0 (Public Domain), available at kaggle.com/datasets/kartik2112/fraud-detection. The RAG baseline simulates retrieval and confidence scoring — no actual LLM API calls are made. No outside API keys are needed to recreate any result in this article. All code described was created and verified by me.

Top Posts

Polymarket vs Kalshi: The Battle for the FIFA World Cup Prediction Crown

Chinese Hackers Secretly Hijack Authentication Flow to Spy on Isolated Network for a Decade

SBA Launches Bold Audit to Scrutinize Economically Disadvantaged Contracting Firms

Bigger Context Isn’t the Answer for RAG — Here’s the System That Actually Solves It

`“Anthropic Yanks Claude Fable 5 and Mythos 5 Following Federal Directive”`

`Unlocking Claude Code’s Full Potential with Local Model Pairings`

`Insider Leak: Claude Fable 5 Quietly Curbed AI Researchers—And the Internet Exploded`

`AI-Powered Portfolio Trading: The Future of Automated Investing`

`Why Decade-Old Residual Connections Still Dominate AI—And Why That’s Holding Us Back`

`When PyMuPDF Misses the Table: Unlocking PDF Parsing for RAG with Azure Layout`

`Polymarket vs Kalshi: The Battle for the FIFA World Cup Prediction Crown`

`Chinese Hackers Secretly Hijack Authentication Flow to Spy on Isolated Network for a Decade`

`SBA Launches Bold Audit to Scrutinize Economically Disadvantaged Contracting Firms`

`Wearable Health Data Is Overwhelming Doctors — Here’s What Comes Next`

`Bigger Context Isn’t the Answer for RAG — Here’s the System That Actually Solves It`

`Moonshot AI Unveils Kimi K2.7-Code: A Leap Forward with +21.8% Boost on Kimi Code Bench v2 Over K2.6`

`Commerce Moves to Slash Nearly Half Its Web Presence in Sweeping Gov Reform Push`

`Hidden Dangers in At-Home DNA Test Kits: The Risks the Fine Print Won’t Tell You`

`Trending`

`Polymarket vs Kalshi: The Battle for the FIFA World Cup Prediction Crown`

`Chinese Hackers Secretly Hijack Authentication Flow to Spy on Isolated Network for a Decade`

`Latest Posts`

`Not More Data, but Better World Models – Unite.AI`

`OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears`

Subscribe to Updates

Top Posts

Bigger Context Isn’t the Answer for RAG — Here’s the System That Actually Solves It

TL;DR

I Trusted the Wrong Number

Why RAG Can’t Aggregate

The Benchmark: Two Pipelines, Same Query

Error Observability Collapse

The Semantic Engine: Proof That the Correct Answer Is Fast

The Fix Is Not Better Retrieval

Benchmark Routing Results

Comprehensive Test Coverage

Known Limitations

Impact

References

Related Posts

`Related Posts`