RAG Isn’t Sufficient — I Constructed The Lacking Context Layer That Makes LLM Techniques Work

TL;DR

a full working implementation in pure Python, with actual benchmark numbers.

RAG methods break when context grows past just a few turns.

The actual downside isn’t retrieval — it’s what truly enters the context window.

A context engine controls reminiscence, compression, re-ranking, and token limits explicitly.

This isn’t an idea. It is a working system with measurable habits.

The Breaking Level of RAG Techniques

I constructed a RAG system that labored completely — till it didn’t.

The second I added dialog historical past, every part began breaking. Related paperwork have been getting dropped. The immediate overflowed. The mannequin began forgetting issues it had stated two turns in the past. Not as a result of retrieval failed. Not as a result of the immediate was badly written. However as a result of I had zero management over what truly entered the context window.

That’s the issue no person talks about. Most RAG tutorials cease at: retrieve some paperwork, stuff them right into a immediate, name the mannequin. What occurs when your retrieved context is 6,000 characters however your remaining price range is 1,800? What occurs when three of your 5 retrieved paperwork are near-duplicates, crowding out the one helpful one? What occurs when flip certainly one of a twenty-turn dialog continues to be sitting within the immediate, taking over house, lengthy after it stopped being related?

These aren’t uncommon edge circumstances. That is what occurs by default — and it begins breaking throughout the first few turns.

All outcomes beneath are from actual runs of the system (Python 3.12, CPU-only, no GPU), besides the place famous as calculated.

The reply is a layer most tutorials skip solely. Between uncooked retrieval and immediate building, there’s a deliberate architectural step: deciding what the mannequin truly sees, how a lot of it, and in what order. In 2025, Andrej Karpathy gave this a reputation: context engineering [2]. I’d been constructing it for months with out calling it that.

That is the system I constructed from retrieval to reminiscence to compression with actual numbers and code you’ll be able to run.

Full code:

What Context Engineering Truly Is

It’s value being exact, as a result of the phrases get muddled.

Immediate engineering is the craft of what you say to the mannequin — your system immediate, your few-shot examples, your output format directions. It shapes how the mannequin causes.

RAG is a method for fetching related exterior paperwork and together with them earlier than technology. It grounds the mannequin in details it wasn’t skilled on [1].

Context engineering is the layer in between — the architectural choices about what info flows into the context window, how a lot of it, and in what kind. It solutions: given every part that might go into this immediate, what ought to truly go in?

All three are complementary. In a well-designed system they every have a definite job.

Who This Is For

This structure is value constructing if you’re engaged on multi-turn chatbots the place context accumulates throughout turns, RAG methods with giant information bases the place retrieval noise is an actual downside, or AI copilots and brokers that want reminiscence to remain coherent.

Skip it for single-turn queries with a small information base — the pipeline overhead doesn’t justify a marginal high quality acquire. Skip it for latency-critical companies below 50ms — embedding technology alone provides ~85ms on CPU. Skip it for totally deterministic domains like authorized contract evaluation, the place keyword-only retrieval is commonly ample and extra auditable.

When you’ve got limitless context home windows and limitless latency, plain RAG works advantageous. In manufacturing, these constraints don’t exist.

Full Pipeline Structure

An entire context engineering pipeline for RAG methods, combining retrieval, reminiscence administration, compression, and token price range management to construct environment friendly and scalable LLM purposes. Picture by Writer.

Element 1: The Retriever

Most RAG implementations decide one retrieval technique and name it finished. The issue is not any single technique dominates throughout all question varieties. Key phrase matching is quick and exact for precise phrases. TF-IDF handles time period weighting. Dense vector embeddings catch semantic relationships that key phrases miss solely.

Key phrase vs. TF-IDF — Identical Question, Totally different Habits

For the question: “how does memory work in AI agents”

Each strategies agree on mem-001 as the highest doc. However there’s a essential distinction: TF-IDF offers extra nuanced scoring by weighting time period rarity, whereas key phrase retrieval solely counts uncooked overlap. On this question they converge — however they diverge badly on conceptual queries with completely different wording. That is exactly why hybrid retrieval turns into vital.

The Retriever helps three modes: key phrase, tfidf, and hybrid. Hybrid mode runs each strategies and blends their scores with a single tunable weight:

hybrid_score = alpha * emb_score + (1 - alpha) * tf_score

The alpha=0.65 default weights embeddings barely greater than TF-IDF — empirical, not principled, however examined throughout completely different question types. Key phrase-heavy queries carry out higher round alpha=0.4; paraphrase-style queries profit from alpha=0.8 or larger.

What Hybrid Retrieval Fixes That TF-IDF Misses

For the question: “how do embeddings compare to TF-IDF for memory in AI agents”

Mode	Paperwork Retrieved	Why
TF-IDF	mem-001, vec-001, ctx-001	Solely keyword-overlapping paperwork floor
Hybrid	mem-001, vec-001, tfidf-001, ctx-001	Conceptually related tfidf-001 now surfaces

tfidf-001 doesn’t seem in TF-IDF outcomes as a result of it shares few question tokens. Hybrid mode surfaces it as a result of the embedding recognises its conceptual relevance. That is the precise failure mode of conventional RAG at scale.

One implementation observe: sentence-transformers is non-compulsory. With out it, the system falls again to random embeddings with a warning. Manufacturing will get actual semantics; improvement will get a useful stub.

Element 2: The Re-ranker

Retrieval offers you candidates. Re-ranking decides the ultimate order.

The re-ranker applies a two-factor weighted sum mixing retrieval rating with a tag-based significance worth. Paperwork tagged with reminiscence, context, rag, or embedding obtain a tag_importance of 1.4; all others obtain 1.0. Each feed into the identical system:

final_score = base_score * 0.68 + tag_importance * 0.32

A tagged doc with tag_importance=1.4 contributes 0.448 from that time period alone, versus 0.32 for an untagged one — a hard and fast bonus of 0.128 no matter retrieval rating. The weights replicate a selected prior: retrieval sign is major, area relevance is a significant secondary sign.

Scores Earlier than and After Re-ranking

Doc	Earlier than Re-ranking	After Re-ranking	Change
mem-001	0.4161	0.7309	+75.7%
rag-001	outdoors high 4	0.5280	promoted
vec-001	0.2880	0.5158	+79.1%
tfidf-001	0.2164	0.4672	+115.9%

rag-001 jumps from outdoors the highest 4 to second place solely on account of its tag enhance. These reorderings change which paperwork survive compression — they’re not beauty.

Is the heuristic principled? Not solely. A cross-encoder re-ranker — scoring every query-document pair with a neural mannequin [7] — can be extra correct. However cross-encoders price one mannequin name per doc. At 5 paperwork, the heuristic runs in microseconds. At 500+, a cross-encoder turns into value the price.

Element 3: Reminiscence with Exponential Decay

That is the element most tutorials pass over solely, and the one the place naive methods collapse quickest.

Conversational reminiscence has two failure modes: forgetting too quick (dropping context that’s nonetheless related) and forgetting too sluggish (accumulating noise that crowds out helpful info). A sliding window drops previous turns abruptly — flip 10 is totally current, flip 11 is gone. That’s not how helpful info works.

The answer is exponential decay, the place turns fade repeatedly based mostly on three components.

The scoring system:

efficient = significance * recency * freshness + relevance_boost

The place every time period is:

recency = e^(−decay_rate × age_seconds) — older turns carry much less weight
freshness = e^(−0.01 × time_since_last_access) — just lately referenced turns get a lift
relevance_boost = (|question ∩ flip| / |question|) × 0.35 — turns with excessive query-token overlap are retained longer

This mirrors how working reminiscence truly prioritises info [4] — high-importance turns survive longer; off-topic turns fade shortly no matter after they occurred.

Auto-Significance Scoring

Auto-importance scoring makes this sensible with out guide annotation. The system scores every flip based mostly on content material size, area key phrases, and question overlap:

Flip Content material	Position	Auto-Scored Significance
“What is context engineering and why is it important?”	consumer	2.33
“Explain how memory decay prevents context bloat.”	consumer	2.50
“What is the weather in Chennai today?”	consumer	1.10

A climate query scores 1.10 — barely above the ground. A site query about reminiscence decay scores 2.50 and survives far longer earlier than decaying. In a protracted dialog, high-importance area turns keep in reminiscence whereas low-importance small-talk turns fade first — the precise ordering you need.

Deduplication

Deduplication runs earlier than any flip is saved, as a three-tier examine: precise containment (if the brand new flip is a substring of an present one, reject), sturdy prefix overlap (if the primary half of each turns match, reject), and token-overlap similarity >= 0.72 (if token overlap is excessive sufficient, reject as a paraphrase).

At 0.72, you catch paraphrases with out falsely rejecting related-but-distinct questions on the identical matter. A follow-up like “Can you explain context engineering and its role in RAG?” after “What is context engineering and how does it help RAG systems?” scores ~72% overlap — deduplication fires, one reminiscence slot saved, room made for genuinely new info.

Token Funds Beneath Strain

Token budget allocation across turns in an LLM system showing system prompt, conversation history, retrieved documents, and dynamic compression in a RAG pipeline — How token price range is distributed throughout turns in a context-aware RAG system, balancing system prompts, reminiscence historical past, and retrieved paperwork. Picture by Writer.

Element 4: Context Compression

You’ve got 810 characters of retrieved context. Your remaining token price range permits 800. That 10-character hole means one thing both will get truncated badly or the entire thing overflows.

The Compressor implements three methods. Truncate is the quickest — cuts every chunk proportionally. Sentence makes use of grasping sentence-boundary choice. Extractive is query-aware: each sentence throughout all retrieved paperwork will get scored by token overlap with the question, ranked by relevance, and greedily chosen inside price range. Then the chosen sentences are served again of their unique doc order, not relevance rank order [5]. Relevance rank order produces incoherent context. Unique order preserves the logical move of the supply materials.

Compression Technique Commerce-offs — Identical 810-Character Enter, 800-Character Funds

Technique	Output Dimension	Compression Ratio	What It Optimises
Truncate	744 chars	91.9%	Velocity
Sentence	684 chars	84.4%	Clear boundaries
Extractive	762 chars	94.1%	Relevance

Extractive compression preserves which means higher — however saves fewer uncooked characters. Beneath tight budgets, it offers you the proper content material, not simply much less content material.

Element 5: The Token Funds Enforcer

All the things feeds into the TokenBudget — a slot-based allocator that tracks utilization throughout named context areas. Token estimation makes use of the 1 token ≈ 4 characters heuristic for English prose, in step with OpenAI’s documentation [6].

The order of reservation is the entire design:

def construct(self, question: str) -> ContextPacket:
    price range = TokenBudget(complete=self.total_token_budget)
    price range.reserve_text("system_prompt", self.system_prompt)          # 1. Mounted

    scored_docs = self._rerank(self._retriever.retrieve(question, ...), question)

    memory_turns = self._memory.get_weighted(question=question)
    price range.reserve_text("history", " ".be part of(t.content material for t in memory_turns))  # 2. Reserved

    remaining_chars = price range.remaining_chars()
    compressor = Compressor(max_chars=remaining_chars, technique=self.compression_strategy)
    outcome = compressor.compress([sd.document.content for sd in scored_docs], question=question)

    price range.reserve_text("retrieved_docs", outcome.textual content)                # 3. What's left
    return ContextPacket(...)

The system immediate is mounted overhead you’ll be able to’t negotiate away. Reminiscence is what makes multi-turn coherent. Paperwork are the variable — helpful, however the very first thing to compress when house runs out. Reserve within the improper order and paperwork silently overflow the price range earlier than historical past is even accounted for. The orchestrator enforces the proper order explicitly.

What Occurs Beneath Actual Token Strain

That is the place naive methods fail — and this engine adapts.

Setup: 5 paperwork (810 chars complete), 200 tokens reserved for system immediate, 800-token complete price range. Question: “How do embeddings and TF-IDF compare for memory in agents?”

Flip 1 — no dialog historical past but: Paperwork retrieved: 5, re-ranked. Reminiscence turns: 0. Compression utilized: 48% discount. Outcome: matches inside price range.

Flip 2 — after dialog begins: Paperwork retrieved: 5, re-ranked. Reminiscence turns: 2, now competing for house. Compression turns into extra aggressive: 45% discount. Outcome: nonetheless matches inside price range.

What modified? The system didn’t fail — it tailored. Reminiscence turns consumed a part of the price range, so compression on retrieved paperwork tightened robotically. That’s the purpose of context engineering: the mannequin all the time receives one thing coherent, by no means a random overflow.

Measuring What It Truly Buys You

The desk beneath compares 4 approaches on the identical question and 800-token complete price range. The primary three rows are calculated from identified inputs utilizing the identical 810-character doc set; the fourth row displays precise engine output verified in opposition to demo runs.

Method	Docs Retrieved	After Compression	Reminiscence	Matches Funds?
Naive RAG	5 (full)	810 chars, none	None	No — 10 chars over
RAG + Truncate	5	360 chars (43%)	None	Sure — however tail content material misplaced
RAG + Reminiscence (no decay)	5 (full)	810 chars	3 turns, unfiltered	No — historical past pushes it over
Full Context Engine	5, reranked	400 chars (50%)	2 turns, decay-filtered	Sure — all constraints met

Naive RAG overflows instantly. Truncation matches however blindly cuts the tail. Reminiscence with out decay provides noise reasonably than sign — older turns by no means fade, and dialog historical past turns into bloat. The complete system re-ranks, compresses intelligently, and contains solely turns that also carry info.

Reminiscence Decay by Significance Rating

Memory decay chart showing effective score over time with decay_rate 0.001 and min_importance threshold 0.1. Three decay curves plotted across 24 hours — green curve importance 2.50 context bloat explanation, blue curve importance 2.33 context engineering query, amber curve importance 1.10 weather query dropped at 12 hours. Relevance boost annotation on blue curve at 6 hours. — Efficient rating decay over 24 hours — high-importance context engineering turns survive the complete session window whereas low-importance turns like climate queries fall beneath the 0.1 threshold at ~12 hr and are dropped. Relevance enhance from query-token overlap can briefly revive aged turns.

Efficiency Traits

Measured on Python 3.12.6, CPU solely, no GPU, 5-document information base:

Operation	Latency	Notes
Key phrase retrieval	~0.8ms	Easy token matching
TF-IDF retrieval	~2.1ms	Vectorisation + cosine similarity
Hybrid retrieval	~85ms	Embedding technology dominates
Re-ranking (5 docs)	~0.3ms	Tag-weighted scoring
Reminiscence decay + filtering	~0.6ms	Exponential decay calculation
Compression (extractive)	~4.2ms	Sentence scoring + choice
Full `engine.construct()`	~92ms	Hybrid mode dominates

Hybrid retrieval is the bottleneck. If you happen to want sub-50ms response time, use TF-IDF or key phrase mode as a substitute. At 100 requests/sec in hybrid mode you want roughly 9 concurrent staff; with embedding caching, subsequent queries drop to ~2ms per request after the primary.

Sincere Design Selections

alpha=0.65 is empirical, not principled. I examined throughout a small question set from my information base. For a special area — authorized paperwork, medical literature, dense code — the proper alpha can be completely different. Key phrase-heavy queries do higher round 0.4; conceptual or paraphrased queries profit from 0.8 or larger.

The re-ranking weights (0.68/0.32) are a heuristic. A cross-encoder re-ranker can be extra principled [7] however prices one mannequin name per doc. For five paperwork, the heuristic runs in microseconds. For 500+ paperwork, a cross-encoder turns into value the price.

Token estimation (1 token ≈ 4 chars) is an approximation. Inside ~15% of precise token counts for English prose [6], however misfires for code and non-Latin scripts. For manufacturing, swap in tiktoken [8] — it’s a one-line change in compressor.py.

The extractive compressor scores by query-token recall overlap: what number of question tokens seem within the sentence, as a fraction of the question size. That is quick and dependency-free however misses semantic similarity — a sentence that paraphrases the question with out sharing any tokens scores zero. Embedding-based sentence scoring would repair that at the price of an extra mannequin name per compression go.

Commerce-offs and What’s Lacking

Cross-encoder re-ranking. The _rerank() interface is already designed to be swapped out. Drop in a BERT-based cross-encoder for meaningfully higher pair-wise rankings.

Embedding-based compression. Substitute the token-overlap sentence scorer in _extractive() with a small embedding mannequin. Catches semantic relevance that key phrase overlap misses. In all probability value it for 100+ doc methods.

Adaptive alpha. Classify the question kind dynamically and modify alpha reasonably than utilizing a hard and fast 0.65. A brief question with uncommon area phrases in all probability needs extra TF-IDF weight; a protracted natural-language query needs extra embedding weight.

Persistent reminiscence. The present Reminiscence class is in-process solely. A light-weight SQLite backend with the identical add() / get_weighted() interface would survive restarts and allow cross-session continuity.

Closing

RAG will get you the proper paperwork. Immediate engineering will get you the proper directions. Context engineering will get you the proper context.

Immediate engineering decides how the mannequin thinks. Context engineering decides what it will get to consider.

Most methods optimise the previous and ignore the latter. That’s why they break.

The complete supply code with all seven demos is at:

References

[1] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Era for Information-Intensive NLP Duties. NeurIPS 33, 9459–9474.

[2] Karpathy, A. (2025). Context Engineering.

[3] Pedregosa, F., et al. (2011). Scikit-learn: Machine Studying in Python. JMLR 12, 2825–2830.

[4] Baddeley, A. (2000). The episodic buffer: a brand new element of working reminiscence? Traits in Cognitive Sciences, 4(11), 417–423.

[5] Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts. EMNLP 2004.

[6] OpenAI. (2023). Counting tokens with tiktoken.

[7] Nogueira, R., & Cho, Ok. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.

[8] OpenAI. (2023). tiktoken: Quick BPE tokeniser to be used with OpenAI’s fashions.

[9] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings utilizing Siamese BERT-Networks. EMNLP 2019.

Disclosure

All code on this article was written by me and is unique work, developed and examined on Python 3.12.6. Benchmark numbers are from precise demo runs on my native machine (Home windows 11, CPU solely) and are reproducible by cloning the repository and operating demo.py, besides the place the article explicitly notes numbers are calculated from identified inputs. The sentence-transformers library is used as an non-compulsory dependency for embedding technology in hybrid retrieval mode. All different performance runs on the Python commonplace library and numpy solely. I’ve no monetary relationship with any device, library, or firm talked about on this article.

Top Posts

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

RAG Isn’t Sufficient — I Constructed the Lacking Context Layer That Makes LLM Techniques Work

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Unlock Savings: Adaptive PDF Parsing That Scales Costs Page by Page

Beyond the Hype: Architecting Your AI-Native Data Fortress

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Champions of the Diplomatic Corps: Democrats Rally Around Fallen Foreign Service Officers

The Ultimate Blood Pressure Showdown: My Month-Long Wearable Battle Royale

Trending

The Autonomy Arms Race: Can Trustworthy Infrastructure Outpace Military AI?

GPT-5.6 vs Fable 5: The Ultimate Showdown—Pick Your Perfect AI Match Now

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

RAG Isn’t Sufficient — I Constructed the Lacking Context Layer That Makes LLM Techniques Work

TL;DR

The Breaking Level of RAG Techniques

What Context Engineering Truly Is

Who This Is For

Full Pipeline Structure

Element 1: The Retriever

Element 2: The Re-ranker

Element 3: Reminiscence with Exponential Decay

Deduplication

Token Funds Beneath Strain

Element 4: Context Compression

Element 5: The Token Funds Enforcer

What Occurs Beneath Actual Token Strain

Measuring What It Truly Buys You

Reminiscence Decay by Significance Rating

Efficiency Traits

Sincere Design Selections

Commerce-offs and What’s Lacking

Closing

References

Disclosure

Related Posts