As builders construct more and more subtle brokers on Cloudflare, one of many largest challenges they face is getting the proper data into context on the proper time. The standard of outcomes produced by fashions is immediately tied to the standard of context they function with, however whilst context window sizes develop previous a million (1M) tokens, context rot stays an unsolved drawback. A pure stress emerges between two unhealthy choices: maintain all the things in context and watch high quality degrade, or aggressively prune and threat dropping data the agent wants later.
At this time we’re saying the personal beta of Agent Reminiscence, a managed service that extracts data from agent conversations and makes it out there when it’s wanted, with out filling up the context window.
It offers AI brokers persistent reminiscence, permitting them to recall what issues, overlook what does not, and get smarter over time. On this submit, we’ll clarify the way it works — and what it might probably assist you construct.
The state of agentic reminiscence
Agentic reminiscence is among the fastest-moving areas in AI infrastructure, with new open-source libraries, managed providers, and analysis prototypes launching on a near-weekly foundation. These choices range broadly in what they retailer, how they retrieve, and what sorts of brokers they’re designed for. Benchmarks like LongMemEval, LoCoMo, and BEAM present helpful apples-to-apples comparisons, however in addition they make it straightforward to construct methods that overfit for a particular analysis and break down in manufacturing.
Current choices additionally differ in structure. Some are managed providers that deal with extraction and retrieval within the background, others are self-hosted frameworks the place you run the reminiscence pipeline your self. Some expose constrained, purpose-built APIs that maintain reminiscence logic out of the agent’s predominant context; others give the mannequin uncooked entry to a database or filesystem and let it design its personal queries, burning tokens on storage and retrieval technique as an alternative of the particular job. Some attempt to match all the things into the context window, partitioning throughout a number of brokers if wanted, whereas others use retrieval to floor solely what’s related.
Agent Reminiscence is a managed service with an opinionated API and retrieval-based structure. We have rigorously thought-about the options, and we imagine this mix is the proper default for many manufacturing workloads. Tighter ingestion and retrieval pipelines are superior to giving brokers uncooked filesystem entry. Along with improved value and efficiency, they supply a greater basis for complicated reasoning duties required in manufacturing, like temporal logic, supersession, and instruction following. We’ll probably expose knowledge for programmatic querying down the street, however we anticipate that to be helpful for edge circumstances, not widespread circumstances.
We constructed Agent Reminiscence as a result of the workloads we see on our platform uncovered gaps that present approaches do not totally deal with. Brokers working for weeks or months towards actual codebases and manufacturing methods want reminiscence that stays helpful because it grows — not simply reminiscence that performs properly on a clear benchmark dataset which will match fully into a more moderen mannequin’s context window.
They want quick ingestion. They want retrieval that does not block the dialog. And they should run on fashions that maintain the per-query value affordable.
Agent Reminiscence shops recollections in a profile, which is addressed by identify. A profile offers you many operations: ingest a dialog, bear in mind one thing particular, recall what you want, record recollections, or overlook a particular reminiscence. Ingest is the majority path that’s sometimes referred to as when the harness compacts context. Bear in mind is for the mannequin to retailer one thing vital on the spot. Recall runs the complete retrieval pipeline and returns a synthesized reply.
export default {
async fetch(request: Request, env: Env): Promise {
// Get a profile -- an remoted reminiscence retailer shared throughout periods, brokers, and customers
const profile = await env.MEMORY.getProfile("my-project");
// Ingest -- extract recollections from a dialog (sometimes referred to as at compaction)
await profile.ingest([
{ role: "user", content: "Set up the project with React and TypeScript." },
{ role: "assistant", content: "Done. Scaffolded a React + TS project targeting Workers." },
{ role: "user", content: "Use pnpm, not npm. And dark mode by default." },
{ role: "assistant", content: "Got it -- pnpm and dark mode as default." },
], { sessionId: "session-001" });
// Bear in mind -- retailer a single reminiscence explicitly (direct device use by the mannequin)
const reminiscence = await profile.bear in mind({
content material: "API rate limit was increased to 10,000 req/s per zone after the April 10 incident.",
sessionId: "session-001",
});
// Recall -- retrieve recollections and get a synthesized reply
const outcomes = await profile.recall("What package manager does the user prefer?");
console.log(outcomes.end result); // "The user prefers pnpm over npm."
return Response.json({ okay: true });
},
}; Agent Reminiscence is accessed by way of a binding from any Cloudflare Employee. It will also be accessed by way of a REST API for brokers working outdoors of Staff, following the identical sample as different Cloudflare developer platform APIs. Should you’re constructing with the Cloudflare Brokers SDK, the Agent Reminiscence service integrates neatly because the reference implementation for dealing with compaction, remembering, and looking over recollections in the reminiscence portion of the Periods API.
What you’ll be able to construct with it
Agent Reminiscence is designed to work throughout a variety of agent architectures:
Reminiscence for particular person brokers. No matter whether or not you are constructing with coding brokers like Claude Code or OpenCode with a human within the loop, utilizing self-hosted agent frameworks like OpenClaw or Hermes to behave in your behalf, or wiring up managed providers like Anthropic’s Managed Brokers, Agent Reminiscence can function the persistent reminiscence layer with none modifications to the agent’s core loop.
Reminiscence for customized agent harnesses. Many groups are constructing their very own agent infrastructure, together with background brokers that run autonomously with out a human within the loop. Ramp Examine is one public instance; Stripe and Spotify have described related methods. These harnesses may profit from giving their brokers reminiscence that persists throughout periods and survives restarts.
Shared reminiscence throughout brokers, folks, and instruments. A reminiscence profile does not must belong to a single agent. A crew of engineers can share a reminiscence profile in order that data discovered by one particular person’s coding agent is obtainable to everybody: coding conventions, architectural selections, tribal data that at the moment lives in folks’s heads or will get misplaced when context is pruned. A code overview bot and a coding agent can share reminiscence in order that overview suggestions shapes future code era. The data your brokers accumulate stops being ephemeral and begins changing into a sturdy crew asset.
Whereas search is a part of reminiscence, agent search and agent reminiscence clear up distinct issues. AI Search is our primitive for locating outcomes throughout unstructured and structured information; Agent Reminiscence is for context recall. The info in Agent Reminiscence does not exist as information; it is derived from periods. An agent can use each, and they’re designed to work collectively.
As brokers change into extra succesful and extra deeply embedded in enterprise processes, the reminiscence they accumulate turns into genuinely worthwhile — not simply as an operational state, however as institutional data that took actual work to construct. We’re listening to rising concern from prospects about what it means to tie that asset to a single vendor, which is affordable. The extra an agent learns, the upper the switching value if that reminiscence cannot transfer with it.
Agent Reminiscence is a managed service, however your knowledge is yours. Each reminiscence is exportable, and we’re dedicated to creating positive the data your brokers accumulate on Cloudflare can go away with you in case your wants change. We predict the proper approach to earn long-term belief is to make leaving straightforward and to maintain constructing one thing ok that you do not need to.
To grasp what occurs behind the API proven above, it helps to interrupt down how brokers handle context. An agent has three parts:
A harness that drives repeated calls to a mannequin, facilitates device calls, and manages state.
A mannequin that takes context and returns completions.
State that features each the present context window and extra data outdoors context: dialog historical past, information, databases, reminiscence.
The important second in an agent’s context lifecycle is compaction, when the harness decides to shorten context to remain inside a mannequin’s limits or to keep away from context rot. At this time, most brokers discard data completely. Agent Reminiscence preserves data on compaction as an alternative of dropping it.
Agent Reminiscence integrates into this lifecycle in two methods:
Bulk ingestion at compaction. When the harness compacts context, it ships the dialog to Agent Reminiscence for ingestion. Ingestion extracts information, occasions, directions, and duties from the message historical past, deduplicates them towards present recollections, and shops them as recollections for future retrieval.
Direct device use by the mannequin. The mannequin will get instruments to work together immediately with recollections, together with the power to recall (search recollections for particular data). The mannequin may bear in mind (explicitly retailer recollections primarily based on one thing vital), overlook (mark a reminiscence as now not vital or true), and record (see what recollections are saved). These are light-weight operations that do not require the mannequin to design queries or handle storage. The first agent ought to by no means burn context on storage technique. The device floor it sees is intentionally constrained in order that reminiscence stays out of the best way of the particular job.
When a dialog arrives for ingestion, it passes by means of a multi-stage pipeline that extracts, verifies, classifies, and shops recollections.
Step one is deterministic ID era. Every message will get a content-addressed ID — a SHA-256 hash of session ID, position, and content material, truncated to 128 bits. If the identical dialog is ingested twice, each message resolves to the identical ID, making re-ingestion idempotent.
Subsequent, the extractor runs two passes in parallel. A full cross chunks messages at roughly 10K characters with two-message overlap and processes as much as 4 chunks concurrently. Every chunk will get a structured transcript with position labels, relative dates resolved to absolutes (“yesterday” turns into “2026-04-14”), and line indices for supply provenance. For longer conversations (9+ messages), a element cross runs alongside the complete cross, utilizing overlapping home windows that focus particularly on extracting concrete values like names, costs, model numbers, and entity attributes that broad extraction tends to overlook. The 2 end result units are then merged.
The following step is to confirm every extracted reminiscence towards the supply transcript. The verifier runs eight checks protecting entity identification, object identification, location context, temporal accuracy, organizational context, completeness, relational context, and whether or not inferred information are literally supported by the dialog. Every merchandise is handed, corrected, or dropped accordingly.
The pipeline then classifies every verified reminiscence into one in every of 4 varieties.
Information signify what’s true proper now, atomic, steady data like “the project uses GraphQL” or “the user prefers dark mode.”
Occasions seize what occurred at a particular time, like a deployment or a choice.
Directions describe learn how to do one thing, similar to procedures, workflows, runbooks.
Duties monitor what’s being labored on proper now and are ephemeral by design.
Information and directions are keyed. Every will get a normalized matter key, and when a brand new reminiscence has the identical key as an present one, the previous reminiscence is outdated slightly than deleted. This creates a model chain with a ahead pointer from the previous reminiscence to the brand new reminiscence. Duties are excluded from the vector index fully to maintain it lean however stay discoverable by way of full-text search.
Lastly, all the things is written to storage utilizing INSERT OR IGNORE in order that content-addressed duplicates are silently skipped. After returning a response to the harness, background vectorization runs asynchronously. The embedding textual content prepends the 3-5 search queries generated throughout classification to the reminiscence content material itself, bridging the hole between how recollections are written (declaratively: “user prefers dark mode”) and the way they’re searched (interrogatively: “what theme does the user want?”). Vectors for outdated recollections are deleted in parallel with new upserts.
When an agent searches for a reminiscence, the question goes by means of a separate retrieval pipeline. Throughout improvement, we found that no single retrieval methodology works greatest for all queries, so we run a number of strategies in parallel and fuse the outcomes.
The primary stage runs question evaluation and embedding concurrently. The question analyzer produces ranked matter keys, full-text search phrases with synonyms, and a HyDE (Hypothetical Doc Embedding), a declarative assertion phrased as if it have been the reply to the query. This stage embeds the uncooked question immediately, and each embeddings are used downstream.
Within the subsequent stage, 5 retrieval channels run in parallel. Full-text search with Porter stemming handles key phrase precision for queries the place you recognize the precise time period however not the encircling context. Actual fact-key lookup returns outcomes the place the question maps on to a recognized matter key. Uncooked message search queries the saved dialog messages immediately by way of full-text seek for unclassified dialog fragments that act as a security web, catching verbatim particulars that the extraction pipeline could have generalized away. Direct vector search finds semantically related recollections utilizing the embedded question. And HyDE vector search finds recollections which might be much like what the reply would appear like, which frequently surfaces outcomes that direct embedding misses — significantly for summary or multi-hop queries the place the query and the reply use totally different vocabulary.
Within the third and last stage, outcomes from all 5 retrieval channels are merged utilizing Reciprocal Rank Fusion (RRF), the place every end result receives a weighted rating primarily based on the place it ranked inside a given channel. Truth-key matches get the best weight as a result of an actual matter match is the strongest sign. Full-text search, HyDE vectors, and direct vectors are every weighted primarily based on power of sign. Lastly, uncooked message matches are additionally included with low weight as a security web to establish candidate outcomes the extraction pipeline could have missed. Ties are damaged by recency, with newer outcomes ranked increased.
The pipeline then passes the highest candidates to the synthesis mannequin, which generates a natural-language reply to the unique search question. Some particular question varieties get particular remedy. For example, temporal computation is dealt with deterministically by way of regex and arithmetic, not by the LLM. The outcomes are injected into the synthesis immediate as pre-computed information. Fashions are unreliable at issues like date math, so we do not ask them to do it.
Our preliminary prototype of Agent Reminiscence was light-weight, with a primary extraction pipeline, vector storage, and easy retrieval. It labored properly sufficient to show the idea, however not properly sufficient to ship.
So we put it into an agent-driven loop and iterated. The cycle appeared like this: run benchmarks, analyze the place we had gaps, suggest options, have a human overview the proposals to pick out methods that generalize slightly than overfit, let the agent make the modifications, repeat.
This labored properly, however got here with one particular problem. LLMs are stochastic, even with temperature set to zero. This triggered outcomes to range throughout runs, which meant we needed to common a number of runs (time-consuming for big benchmarks) and depend on development evaluation alongside uncooked scores to grasp what was truly working. Alongside the best way we needed to guard rigorously towards overfitting the benchmarks in ways in which did not genuinely make the product higher for the overall case.
Over time, this acquired us to a spot the place benchmark scores improved constantly with every iteration and we had a generalized structure that will work in the actual world. We deliberately examined towards a number of benchmarks (together with LoCoMo, LongMemEval, and BEAM) to push the system in numerous methods.
We construct Cloudflare on Cloudflare, and Agent Reminiscence is not any totally different. Current primitives which might be highly effective and simply composable allowed us to ship the primary prototype in a weekend and a completely functioning, productionized inside model of Agent Reminiscence in lower than a month. Along with pace of supply, Cloudflare turned out to be the best place to construct this type of service for a couple of different causes.
Underneath the hood, Agent Reminiscence is a Cloudflare Employee that coordinates a number of methods:
Sturdy Object: shops the uncooked messages and categorized recollections
Vectorize: offers vector search over embedded recollections
Staff AI: runs the LLMs and embedding fashions
Every reminiscence context maps to its personal Sturdy Object occasion and Vectorize index, holding knowledge totally remoted between contexts. It additionally permits us to scale simply with increased calls for.
Compute isolation by way of Sturdy Objects. Every reminiscence profile will get its personal Sturdy Object (DO) with a SQLite-backed retailer, offering sturdy isolation between tenants with none infrastructure overhead. The DO handles FTS indexing, supersession chains, and transactional writes. DO’s getByName() addressing means any request, from anyplace, can attain the proper reminiscence profile by identify, and ensures that delicate recollections are strongly remoted from different tenants.
Storage throughout the stack. Reminiscence content material lives in SQLite-backed DOs. Vectors stay in Vectorize. Sooner or later, snapshots and exports will go to R2 for cost-efficient long-term storage. Every primitive is purpose-built for its workload, we needn’t power all the things right into a single form or database.
Native mannequin inference with Staff AI. The whole extraction, classification, and synthesis pipeline runs on Staff AI fashions deployed on Cloudflare’s community. All AI calls cross a session affinity header routed to the reminiscence profile identify, so repeated requests hit the identical backend for immediate caching advantages.
One fascinating discovering from our mannequin choice: a much bigger, extra highly effective mannequin is not all the time higher. We at the moment default to Llama 4 Scout (17B, 16-expert MoE) for extraction, verification, classification, and question evaluation, and Nemotron 3 (120B MoE, 12B lively parameters) for synthesis. Scout handles the structured classification duties effectively, whereas Nemotron’s bigger reasoning capability improves the standard of natural-language solutions. The synthesizer is the one stage the place throwing extra parameters on the drawback constantly helped. For all the things else, the smaller mannequin hit a greater candy spot of value, high quality, and latency.
We run Agent Reminiscence internally for our personal workflows at Cloudflare, as each a proving floor and a supply of concepts for what to construct subsequent.
Coding agent reminiscence. We use an inside OpenCode plugin that wires Agent Reminiscence into the event loop. Agent Reminiscence offers reminiscence of previous compaction inside periods and throughout them. The much less apparent profit has been shared reminiscence throughout a crew: with a shared profile, the agent is aware of what different members of your crew have already discovered, which implies it might probably cease asking questions which have already been answered and cease making errors which have already been corrected.
Agentic code overview. We have linked Agent Reminiscence to our inside agentic code reviewer. Arguably probably the most helpful factor it discovered to do was keep quiet. The reviewer now remembers {that a} specific remark wasn’t related in a previous overview, {that a} particular sample was flagged, and the writer selected to maintain it for an excellent motive. Evaluations get much less noisy over time, not simply smarter.
Chat bots. We have additionally wired reminiscence into an inside chat bot that ingests message historical past after which lurks and remembers new messages which might be despatched. Then, when somebody asks a query, the bot can reply primarily based on earlier conversations.
We even have numerous extra use circumstances that we plan to roll out internally within the close to future as we refine and enhance the service.
We’re persevering with to check and refine Agent Reminiscence internally, bettering the extraction pipeline, tuning retrieval high quality, and increasing the background processing capabilities. Just like how the human mind consolidates recollections by replaying and strengthening connections throughout sleep, we see alternatives for reminiscence storage to enhance asynchronously and are at the moment implementing and testing varied methods to make this work.
We plan to make Agent Reminiscence publicly out there quickly. Should you’re constructing brokers on Cloudflare and wish early entry, contact us to hitch the waitlist.
If you wish to dig into the structure, share what you are constructing, or comply with alongside as we develop this additional, be part of us on the Cloudflare Discord or begin a thread within the Cloudflare Neighborhood. We’re actively watching each, and are curious about what manufacturing agent workloads truly appear like within the wild.



