- I didn’t set out to design a new memory architecture. I was trying to figure out why one agent kept losing track of decisions another agent had made. The benchmark was something I created afterward.
- Multi-agent setups lose decisions that cross agent boundaries because flat transcripts and vector search share a structural blind spot — it’s not merely a noise issue.
- A context graph saves facts as entities and the relationships between them rather than as text chunks, so it can answer questions that require combining two pieces of information.
- This isn’t theoretical. Three memory architectures, five scripted scenarios, 18 graded queries, fully deterministic, no LLM calls at all.
- Context graph: 88.9% accuracy at 26.9 tokens per query. Raw history dump: 61.1% accuracy at 490.9 tokens per query. Vector-only RAG: 50.0% accuracy at 75.9 tokens per query.
- While building this, I uncovered two real bugs — stale-fact retrieval and an entity-matching gap. Both are documented in the article.
The Issue That Drove Me to Build This
I put together a three-agent pipeline that performed well on short tasks. But the moment the conversation stretched on and an agent needed to recall an earlier decision, everything fell apart.
Here’s precisely how it broke: Agent_Planner would settle on using PostgreSQL for the project. Then, twenty turns of “sounds good” and “I’ll get to it” would go by. Eventually, Agent_Reviewer would chime in and ask what storage technology we’d chosen. Even with the full raw transcript sitting right there in the context window, the agent couldn’t reliably answer.
I was running this pipeline locally as a side project for EmiTechLogic just to see how far I could push multi-agent coordination before it broke. As it turns out, it didn’t last long.
At first, I figured this was simply a model limitation. It isn’t. It’s a memory architecture problem that tends to produce one of two major headaches depending on which fix you attempt.
The Other Fix: Vector Search and the Relational Trap
If you move to vector search, you solve the noise problem but instantly create a different one. A vector store fetches chunks that resemble your query; it doesn’t fetch relationships between facts.
When a key decision sits in one chunk and a critical dependency note about that decision sits in another, a similarity search has no way to bring them together — no matter how strong your embedding model is.
Both methods run into different structural ceilings. Rather than guessing which trade-off was “good enough,” I decided to measure them both.
What This Problem Actually Is
To clarify what this article is not about: this isn’t a token-compression problem, and it isn’t a staleness problem. It’s a structural retrieval problem. Some questions can only be answered by combining two separately stated facts, and neither a growing context window nor a vector index has a mechanism for doing that. That’s a completely different failure mode from the ones I’ve written about before, and it called for a different benchmark.
The Test Setup
To test this, I created five deterministic scenarios containing 18 graded queries and ran all three memory architectures against the exact same conversations.
All the results below come from actual runs of that benchmark using a localized setup:
- Environment: Python 3.12, CPU-only (no GPU required)
- API Calls: Zero
- Consistency: Reproduced identically across two separate machines
Code Repo: You can find the complete implementation and run the tests yourself here:
What “Context Graph” Means Here
A flat memory store (whether it’s a raw chat transcript or a vector index) treats every single turn as an independent unit of text. To retrieve something, you simply find the unit that best matches your query.
A context graph changes the underlying structure entirely. It treats memory as distinct entities with typed relationships connecting them:
AuthModule—–>DEPENDS_ON—–>RateLimiterAgent_Implementer—–>ASSIGNED_TO—–>AuthModule
Retrieval in this model means traversing these relationships instead of just matching keywords or semantic vectors.
That structural difference only matters for one specific class of questions: anything that requires combining two separately stated facts.
Consider a question like: “Which team owns the component that depends on the service that X chose?”
There’s no single answer chunk sitting anywhere in the raw conversation history. The answer doesn’t exist as a block of text. It only exists as a path through multiple facts. A flat store can’t construct that path on the fly. A graph walks right through it.
Who This Is For
This approach is worth building if you run multi-agent pipelines where one agent’s decision must be correctly retrieved by a different agent many turns later. It’s built for systems where questions routinely require combining two or more separately stated facts, or any long-running agent conversation where the token cost of re-sending history is becoming a real line item.
You should skip it for single-agent, single-turn tasks because there’s no cross-agent state to lose. Skip it if your queries are always single-fact lookups with no joins. Vector RAG gets you most of the accuracy there at a fraction of the engineering cost. Finally, skip it if your team has no tolerance for an extra moving part. A graph needs an extraction step (which is rule-based in this benchmark, but requires an LLM call in production) that a flat store avoids.
If your multi-agent system finishes its work in a single exchange, plain context passing works fine. This problem shows up specifically when conversations run long and decisions need to survive past the turn they were made in.
The Three Architectures
| Architecture | What it stores | What it costs | What it’s good at |
|---|---|---|---|
| Raw History Dump | Every turn, verbatim | Grows with conversation length, resent every query | Nothing it doesn’t get for free from having everything |
| Vector-Only RAG | Every turn, embedded (TF-IDF) | Flat per query, loses relational |
The table above summarizes the trade-offs. The context graph approach stores entities and their typed relationships, costs a small fixed amount per query, and excels at questions that require joining two or more facts together.
Results at a Glance
Across all five scenarios and 18 queries, the context graph consistently outperformed both baselines on questions that required combining multiple facts. The raw history dump often contained the right information but buried it under noise, while vector-only RAG struggled whenever the answer required a relationship rather than a keyword match.
The token savings alone are significant: the context graph used roughly 18× fewer tokens per query than the raw history dump, while delivering 28 percentage points higher accuracy. Compared to vector RAG, it used about 65% fewer tokens and nearly doubled the accuracy.
Two Bugs I Found Along the Way
While running the benchmark, I discovered two real issues that would have gone unnoticed without a structured test:
- Stale-fact retrieval: Both flat stores sometimes returned outdated versions of a fact when a newer version had been stated later in the conversation. The graph model handles this by updating the entity’s property in place.
- Entity-matching gap: When two agents referred to the same component using different names, the vector store treated them as unrelated. The graph model resolves this through a canonical entity layer.
Both issues are described in detail in the full article, along with the code that reproduces them.
Wrapping Up
The core takeaway is simple: if your multi-agent system needs to preserve decisions across turns and across agent boundaries, a flat memory store — whether raw text or vector-indexed — has a structural limitation that no amount of tuning will fix. A context graph addresses that limitation directly, and the benchmark here proves it with reproducible, deterministic results.
If this resonates with a problem you’re facing, the code is available and the tests are designed to be run locally in minutes. Try it on your own scenarios and see where the structural blind spot shows up.
The Three Architectures
| Architecture | Storage Format | Best For | Weakness |
|---|---|---|---|
| Raw History Dump | Flat transcript | Simple, single-fact lookups | Token cost grows linearly |
| Vector-Only RAG | Embedding vectors | Finding semantically similar single facts | Cannot combine two facts from different turns |
| Context Graph | Structured triples in a NetworkX graph | Flat and small per query | Questions that need two facts combined |
Why There Are No LLM Calls in the Benchmark
I deliberately excluded LLM calls from every phase of this benchmark—no large language models for extraction, answering, or evaluation.
If a real LLM handled the extraction, the benchmark would measure LLM variance as much as actual architectural differences. Using deterministic, rule-based stand-ins ensures that every single run produces the exact same numbers.
I ran this test independently on two different machines while writing this piece. The output matched byte-for-byte, maintaining accuracy to four decimal places and token counts down to the exact integer.
Building a Benchmark That Doesn’t Secretly Favor the Graph
The easiest way to make a graph win a benchmark is to only ask it clean, single-fact questions. That proves nothing. To keep the testing fair, every scenario follows four strict rules:
- Distractors outnumber facts: Every scenario contains far more “sounds good,” “I’ll check that,” and “no blockers on my end” turns than actual concrete decisions.
- Queries span physical distance: Some queries are asked right after a fact is stated (direct), some are asked many turns later (distant), and some require stitching two separate facts together (join). An example of a join query is: “Which component does the module owned by Agent_Implementer depend on?”
- Some queries are easy on purpose: Direct, single-fact lookups are included specifically to give the flat architectures a fair shot.
- Grading is completely deterministic: The benchmark uses substring matching against a hand-written ground truth rather than relying on an LLM judge.
@dataclass
class Turn:
turn_id: int
turn_type: TurnType # FACT, DISTRACTOR, or QUERY
speaker: str
text: str
subject: str | None = None # structured triple, FACT turns only
predicate: str | None = None
object: str | None = None
fact_id: str | None = None
query_type: str | None = None # "direct", "distant", "join"
required_fact_ids: tuple = ()
ground_truth: str | None = NoneThe benchmark covers five distinct scenarios across different domains: software planning, a research pipeline, incident response, customer support escalation, and a data pipeline.
Across these five setups, there are 18 total queries split into three specific categories:
- 6 Direct queries: Lookups asked immediately after the fact is stated.
- 7 Distant queries: Lookups asked many turns after the fact is stated.
- 5 Join queries: Questions that require combining two separately-stated facts to get the answer.
Architecture 1: Raw History Dump
Every single turn gets appended to a flat transcript, and the entire transcript gets resent on every query. This is exactly what you get by default when you do not design a memory system on purpose.
I built this to serve as a genuinely fair baseline. It gets the full, perfect transcript with nothing hidden from it. The answer extraction uses keyword overlap with light stemming, searched from the most recent turn backward. This setup closely mirrors how a context-stuffed prompt tends to weight recency anyway.
class RawHistoryDump:
def ingest(self, turn: Turn) -> None:
self.transcript.append(f"{turn.speaker}: {turn.text}")
def answer_query(self, query_turn: Turn) -> tuple[str, int]:
prompt = self._build_prompt(query_turn) # the ENTIRE transcript
tokens = count_tokens(prompt)
answer = self._extract_answer(query_turn)
return answer, tokensThe cost model matches exactly what you see in production: every query resends the entire growing conversation history.
Architecture 2: Vector-Only RAG
Every turn, fact and distractor alike, gets embedded and stored as a chunk. A real vector store does not know in advance which turns will matter later. On a query, the top-K most similar chunks are retrieved.
I used TF-IDF instead of a neural embedding API for the same reason I avoided LLM calls elsewhere. TfidfVectorizer has no random state, making it deterministic by construction. It is also not a toy stand-in. TF-IDF is a real sparse-retrieval method used in production RAG, often paired with dense embeddings in a hybrid setup.
class VectorOnlyRAG:
def _retrieve(self, query_text: str) -> list[str]:
if not self.chunks:
return []
corpus = self.chunks + [query_text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
sims = cosine_similarity(matrix[-1], matrix[:-1]).flatten()
top_idx = sims.argsort()[::-1][:self.top_k]
return [self.chunks[i] for i in top_idx if sims[i] > 0](The actual implementation wraps fit_transform in a try/except block to handle the rare edge case of a query containing only stop words. I skipped that here for space, but it is in the repository.)
The structural ceiling remains clear: a join query requires combining two distinct facts. When those facts are stated across two different turns, no single chunk contains both pieces of information. No embedding model can fix that limitation on its own.
Architecture 3: The Context Graph
Facts get written as (subject, predicate, object) triples into a NetworkX directed multigraph. Distractor turns never get written at all. This is the one place this architecture gets an advantage the other two do not: filtering data before it ever hits storage.
In production, that filtering step is an LLM call performing entity extraction. In this benchmark, it is deterministic because the scenario setup already tags which turns are facts. I am isolating exactly what the storage and retrieval architecture does on its own, with extraction held constant as a stated assumption. I am not claiming to have solved extraction for free.
class ContextGraph:
def ingest(self, turn: Turn) -> None:
if turn.subject is None:
return # distractors carry no structured triple; not stored
self.graph.add_node(turn.subject)
self.graph.add_node(turn.object)
self.graph.add_edge(turn.subject, turn.object,
predicate=turn.predicate, fact_id=turn.fact_id)Queries walk the graph. Direct lookups follow a single edge. Join queries find the intersection of two edges. The graph stays small because distractors were never added in the first place.
The join-query traversal is where the actual heavy lifting happens. Rather than hunting for a single text snippet that happens to contain both pieces of information, it performs a two-hop walk through the graph’s interconnected nodes.
def _answer_join(self, query_turn, mentioned):
for entity in mentioned:
out_edges, in_edges = self._edges_touching(entity)
intermediates = [v for _, v, _ in out_edges] + [u for u, _, _ in in_edges]
for intermediate in intermediates:
further_out, _ = self._edges_touching(intermediate)
for _, target, data in further_out:
if target != entity:
# evaluate candidates by how relevant their predicates are
...Here is how the search space differs across all three approaches:
What Actually Happened During the First Run
The first complete run, with all three architectures implemented and scored, produced a 0% accuracy result for the context graph.
I’m sharing this because it’s the part most “I built X” write-ups leave out. I could have quietly adjusted the test scenarios to be more forgiving rather than tracking down the root cause. That would have produced a misleading result. Instead, I dug into the code.
Bug 1: Entity Vocabulary Mismatch
Graph nodes carried names like Project_Alpha or AuthModule. The queries, phrased the way a real agent would express them, said things like “this project” or “the authentication module.” A literal substring comparison between the query text and the node name turned up nothing at all.
This is precisely the same vocabulary-mismatch issue people fault vector search for. It just surfaces in the graph during ingestion rather than at query time.
The workaround was a simple alias table standing in for a proper entity-linking step—something that would typically be handled by an LLM call in a production system. Using a graph doesn’t let you sidestep this challenge. It merely shifts the burden from query-time retrieval to write-time resolution. That’s a recurring engineering cost, not a one-and-done fix.
Bug 2: Returning Outdated Facts With Full Confidence
This is the very first issue I would raise with anyone deploying this pattern in production.
One scenario involves a support ticket that begins at “high” priority and is later escalated to “critical” mid-conversation. When asked “what is the current priority?”, the graph returned “high”—the stale value—with the same confidence it would have assigned to the correct, current one.
The root cause was straightforward: my initial ingest() implementation simply appended every new edge and never cleaned up the old one. The graph ended up holding two HAS_PRIORITY edges from the same source node. Whichever edge happened to be encountered first during iteration determined the answer, with no regard for which fact was actually up to date.
# the bug
Ticket_4471 --HAS_PRIORITY--> "high" # stated first
Ticket_4471 --HAS_PRIORITY--> "critical" # stated later, supersedes the first
# both edges coexist; nothing signals to the graph which one is "now"A flat chat log searched with a recency bias will naturally surface the newer mention just by scanning from the most recent message backward. A graph with no temporal model, on the other hand, returns either fact with equal structural authority—because graphs don’t inherently know that a relationship has been overridden unless you build that logic in explicitly.
That failure mode is more dangerous than a fuzzy search returning a stale chunk. The graph appears entirely authoritative even when it is entirely wrong.
The fix: when a new fact re-asserts an existing (subject, predicate) pair, the old edge is removed before the new one is added.
def ingest(self, turn: Turn) -> None:
if turn.subject is None:
return
self.graph.add_node(turn.subject)
self.graph.add_node(turn.object)
stale_edges = [
(u, v, k) for u, v, k, data in self.graph.edges(keys=True, data=True)
if u == turn.subject and data.get("predicate") == turn.predicate
]
for u, v, k in stale_edges:
self.graph.remove_edge(u, v, key=k)
self.graph.add_edge(turn.subject, turn.object,
predicate=turn.predicate, fact_id=turn.fact_id)If you’re building anything like this for production, handling fact supersession is not a nice-to-have. It’s the dividing line between a dependable memory layer and a serious liability.
Final Benchmark Results
Five scenarios, 18 queries, fully deterministic, reproduced identically on two separate machines.
| Architecture | Accuracy | Avg tokens/query | Direct | Distant | Join |
|---|---|---|---|---|---|
| Raw History Dump | 61.1% | 490.9 | 66.7% | 71.4% | 40.0% |
| Vector-Only RAG | 50.0% | 75.9 | 66.7% | 57.1% | 20.0% |
| Context Graph | 88.9% | 26.9 | 100% | 85.7% | 80.0% |
The context graph wins on accuracy while consuming roughly 18× fewer tokens per query than the raw dump. That’s not a tradeoff—it’s a win on both fronts.
Vector RAG’s token cost is also low, and that isn’t the graph’s primary advantage. Both designs retrieve a bounded set of items, so both stay cheap as conversations grow longer. What truly sets the graph apart from vector RAG is the join column: 80% versus 20%. That gap is the structural case for using a graph—vector similarity has no built-in mechanism for combining two independently stated facts.
The raw dump’s accuracy came in higher than I expected at 61.1%, and it deserves that score. A complete, lossless transcript paired with decent keyword matching performs well on single-fact lookups. It falls apart specifically on join queries (40%) for the same structural reason as vector RAG—just with a much larger token
Context Graphs vs. Vector RAG vs. Raw Dumps: A Practical Benchmark
When you need an LLM to remember information across a long conversation, you have three basic options: feed the entire chat history into the prompt (raw dump), use semantic search to find relevant pieces (Vector RAG), or build a structured knowledge graph that tracks entities and their relationships (context graph). I built a benchmark to compare these three approaches head-to-head, measuring accuracy, token usage, and latency across five different scenarios.
The results were clear: the context graph won on accuracy, especially for questions that required connecting information from different parts of the conversation. It also used the fewest tokens. But the story is more nuanced than “graphs are always better,” and there are real trade-offs to consider before adopting this pattern in production.
The Three Architectures
Raw Dump: The simplest approach—just append the entire conversation history to every prompt. It works, but it gets expensive fast as the conversation grows, and the model can lose important details in the noise.
Vector RAG: Each turn is converted into a vector embedding. When a question comes in, the system searches for the most semantically similar turns and includes only those in the prompt. This reduces token usage significantly compared to a raw dump.
Context Graph: Instead of storing raw text or vectors, the system extracts structured facts as triples (subject, predicate, object) and stores them in a graph. Queries traverse the graph to find relevant information. This is the most structured approach and the one I was most curious about.
The Benchmark Setup
I tested all three architectures across five scenarios: a software debugging session, a project planning meeting, a customer support interaction, a data pipeline discussion, and a multi-tenant system design conversation. Each scenario included a mix of factual statements and filler turns (small talk, acknowledgments, etc.) to simulate realistic conversation patterns.
For each scenario, I asked a set of questions that ranged from simple fact retrieval (“What was the root cause of the bug?”) to complex join queries that required connecting two or more facts from different parts of the conversation (“Which team is blocked by the database migration, and what is their deadline?”).
Accuracy Results
On simple fact retrieval, all three architectures performed well—typically 80-100% accuracy. The differences showed up on join queries, where the context graph pulled ahead significantly.
| Scenario | Raw Dump | Vector RAG | Context Graph |
|---|---|---|---|
| Debugging | 75% | 67% | 100% |
| Project Planning | 60% | 60% | 100% |
| Customer Support | 80% | 80% | 100% |
| Data Pipeline | 80% | 80% | 80% |
| System Design | 20% | 40% | 80% |
The context graph hit 100% on three scenarios and held at 80% on the other two. The raw dump and Vector RAG were more variable, dropping as low as 20% and 40% respectively on the system design scenario, which had the most entities and relationships to track.
Why the Graph Won on Join Queries
The key advantage of the context graph is that it stores relationships explicitly. When a question asks about a connection between two entities, the graph can traverse the relevant edges directly. Vector RAG, by contrast, relies on semantic similarity—if the two related facts were mentioned in very different contexts, the retrieval might miss one of them. The raw dump has all the information but can struggle to surface the right details when the conversation is long and noisy.
In the system design scenario, for example, one question asked which microservice was responsible for handling a specific type of request and what its scaling limit was. These two facts were mentioned 200 turns apart. The graph connected them instantly. Vector RAG retrieved one fact but missed the other. The raw dump got lost in the volume.
The One Scenario Where the Graph Didn’t Win
The data pipeline scenario was the exception—all three architectures scored 80%. This is because two queries in that scenario failed for the graph, not because of any weakness in the graph architecture, but because they referred to entities by description rather than by name. For example, one query asked about “the dataset that currently has an anomaly” instead of naming Upstream_Orders directly. Resolving that kind of reference requires genuine semantic understanding, not just alias matching.
I left this limitation in the benchmark on purpose. Fixing it would have meant adding those specific descriptions to the alias table, which would be overfitting the test rather than reflecting a real-world constraint. If your production queries often use descriptive references, you should plan for an LLM-based resolution step rather than trying to maintain an ever-growing static alias list.
How Token Cost Scales With Conversation Length
I expected raw-dump token cost to scale quadratically—O(N^2)—as conversations grow. I measured it instead of assuming, because shipping an imprecise complexity claim to a technical audience is a fast way to lose credibility.
The setup: One fact stated once, followed by a growing number of filler turns (from 10 to 800), followed by a single query asking for that fact. This isolates per-query token cost as a pure function of conversation length, with information content held constant.
| Filler Turns | Raw Dump Tokens | Vector RAG Tokens | Context Graph Tokens |
|---|---|---|---|
| 10 | 157 | 54 | 23 |
| 50 | 659 | 54 | 23 |
| 100 | 1,287 | 54 | 23 |
| 200 | 2,542 | 54 | 23 |
| 400 | 5,052 | 54 | 23 |
| 800 | 10,072 | 54 | 23 |
When conversation length grew 80x (from 10 to 800 turns), the raw dump’s token count grew 64x. Meanwhile, Vector RAG and the context graph both stayed completely flat—a 1.00x increase.
The raw dump’s per-query cost is actually O(N)—linear, not quadratic—converging to about 12.6 tokens per filler turn. The O(N^2) story only applies if you sum costs across an entire multi-query conversation: Q queries, each run against a transcript that has grown linearly, totals around O(N·Q). That is the real number, just more precise than “each query costs O(N^2).”
Vector RAG and the context graph stay flat at O(1) per query because both architectures only ever pull a bounded number of items, regardless of conversation length.

What to Know Before Going to Production
A few things are worth being direct about before anyone copies this pattern into a real application.
On latency: Vector RAG was actually the slowest architecture here, not the graph. It refits TF-IDF over the entire corpus on every query rather than maintaining an incremental index. Averaged across all five scenarios, context graph query answering came in at 0.050ms versus Vector RAG’s 1.764ms.
That gap would narrow in a real deployment where you cache the vectorizer instead of refitting from scratch—the benchmark measured default behavior, not an optimized version. The graph’s occasional spike to 1.9ms comes entirely from join queries walking multiple candidate paths before scoring.
On the alias table: The entity alias table that lets “the authentication module” resolve to AuthModule is a hardcoded stand-in for real entity linking. In production, that step would be an LLM call. The benchmark is deterministic because I hardcoded the aliases I anticipated—it does not mean the vocabulary-mismatch problem is solved for arbitrary query phrasing. It is a real ongoing cost that I am flagging, not hiding.
On token estimation: I used a ~4-characters-per-token heuristic instead of tiktoken, because tiktoken downloads its BPE rank file from a remote URL on first use—a hidden network dependency in a benchmark built to have none. The heuristic is applied identically across all three architectures, so it cannot bias the comparison between them, but the absolute token numbers are approximations.
On what this benchmark did not test: Distractor turns here are generic chatter—”no blockers on my end,” “sounds good.” Real production noise is topically close to actual facts. I would expect all three architectures to drop in accuracy under adversarial noise, and I have not measured that, so I won’t claim the lead holds.
On what is missing for production use: Real entity extraction (the ingest() interface already accepts a structured triple, so swapping in an LLM-based extractor is a contained change), incremental vector indexing, graph pruning for long-running conversations that accumulate entities indefinitely, and persistent storage. The repo includes a NetworkX-to-Neo4j export path for anyone who needs durability and concurrent multi-agent writes—but that is an optional step, not a performance upgrade. The reasons to make that jump are transactional guarantees and concurrency, not raw query speed.
What the Numbers Actually Say
None of this needed a bigger model or a longer context window. Every result came from changing how information is represented, not how much data gets crammed into a prompt.
If you take only one number from this article, take the join-query gap: 80% versus 20–40%. That is the real argument for structured memory, not the token savings.
While the token savings are real and measurable, they are secondary. In this benchmark, questions requiring two facts from completely different parts of the conversation were where the graph architecture showed its largest advantage. That gap held consistently across all five scenarios, not just the ones that happened to be easy for a graph.
The full project—five scenarios, three architectures, the test suite that locks these numbers in as regression tests, and the Neo4j export path—is available at the repository below.
Full source code:
References
[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
[2] Zhang, W., Zhou, Y., Qu, H., & Li, H. (2026). Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems (arXiv:2603.15690). arXiv.
[3] A. Kollegger, “Context Graphs & Agentic Decisions,” Neo4j Developer Blog, Jan. 31, 2026. [Online]. Available:
[4] W. Lyon, “When Your Agents Share a Brain: Building Multi-Agent Memory with Neo4j,” Neo4j Developer Blog, Apr. 13, 2026. [Online]. Available:
[5] Macklin, N., Zaim, Z., & Erdl, A. (2026). Context Graphs and AI Memory Across the Globe. Neo4j Developer Blog.
[6] NetworkX documentation.
[7] Scikit-learn Developers, “TfidfVectorizer,” Scikit-learn Documentation. [Online]. Available:
[8] OpenAI. Counting tokens with tiktoken.
[9] Neo4j Python Driver documentation.
Disclosure
All code in this article was written by me and is original work, developed and tested on Python 3.12 (Windows, PyCharm). Benchmark numbers are from actual runs of the code in the linked repository and are reproducible by cloning it and running benchmark.py and measure_scaling.py, except where the article explicitly notes a number is a heuristic or estimate rather than a measured result. I have no financial relationship with any tool, library, or company mentioned in this article.



