This began as a result of my Obsidian assistant saved getting amnesia. I didn’t need to rise up Pinecone or Redis simply so Claude may keep in mind that Alice accredited the Q3 funds final week. Seems, with 200K+ context home windows, you may not want any of that.
I need to share a brand new mechanism that I’ve began operating. It’s a system constructed on SQLite and direct LLM reasoning, no vector databases, no embedding pipeline. Vector search was largely a workaround for tiny context home windows and conserving prompts from getting messy. With fashionable context sizes, you possibly can typically skip that and simply let the mannequin learn your recollections straight.
The Setup
I take detailed notes, each in my private life and at work. I used to scrawl in notebooks that might get misplaced or get caught on a shelf and by no means be referenced once more. Just a few years in the past, I moved to Obsidian for every little thing, and it has been incredible. Within the final 12 months, I’ve began hooking up genAI to my notes. At this time I run each Claude Code (for my private notes) and Kiro-CLI (for my work notes). I can ask questions, get them to do roll-ups for management, observe my objectives, and write my reviews. Nevertheless it’s all the time had one massive Achilles’ heel: reminiscence. Once I ask a couple of assembly, it makes use of an Obsidian MCP to go looking my vault. It’s time-consuming, error-prone, and I would like it to be higher.
The apparent repair is a vector database. Embed the recollections. Retailer the vectors. Do a similarity search at question time. It really works. Nevertheless it additionally means a Redis stack, a Pinecone account, or a regionally operating Chroma occasion, plus an embedding API, plus pipeline code to sew all of it collectively. For a private instrument, that’s lots, and there’s a actual threat that it received’t work precisely like I would like it to. I have to ask, what occurred on ‘Feb 1 2026’ or ‘recap the last meeting I had with this person’, issues that embeddings and RAG aren’t nice with.
Then I ran throughout Google’s always-on-memory agent The concept is fairly easy: don’t do a similarity search in any respect; simply give the LLM your latest recollections straight and let it purpose over them.
I needed to know if that held up on AWS Bedrock with Claude Haiku 4.5. So I constructed it (together with Claude Code, after all) and added in some additional bells and whistles.
Go to my GitHub repo, however be certain that to come back again!
An Perception That Adjustments the Math
Older fashions topped out at 4K or 8K tokens. You couldn’t match various paperwork in a immediate. Embeddings allow you to retrieve the related paperwork with out loading every little thing. That was genuinely essential. Haiku 4.5 provides a context window of 250k, so what can we do with that?
A structured reminiscence (abstract, entities, matters, significance rating) runs about 300 tokens. Which implies we will get about 650 recollections earlier than you hit the ceiling. In observe, it’s a bit much less because the system immediate and question additionally devour tokens, however for a private assistant that tracks conferences, notes, and conversations, that’s months of context.
No embeddings, no vector indexes, no cosine similarity.
The LLM causes straight over semantics, and it’s higher at that than cosine similarity.
The Structure
The orchestrator isn’t a separate service. It’s a Python class contained in the FastAPI course of that coordinates the three brokers.
The IngestAgent job is straightforward: take uncooked textual content and ask Haiku what’s price remembering. It extracts a abstract, entities (names, locations, issues), matters, and an significance rating from 0 to 1. That bundle goes into the `recollections` desk.
The ConsolidateAgent runs with clever scheduling: at startup if any recollections exist, when a threshold is reached (5+ recollections by default), and each day as a pressured move. When triggered, it batches unconsolidated recollections and asks Haiku to seek out cross-cutting connections and generate insights. Outcomes land in a `consolidations` desk. The system tracks the final consolidation timestamp to make sure common processing even with low reminiscence accumulation.
The QueryAgent reads latest recollections plus consolidation insights right into a single immediate and returns a synthesized reply with quotation IDs. That’s the entire question path.
What Really Will get Saved
Whenever you ingest textual content like “Met with Alice today. Q3 budget is approved, $2.4M,” the system doesn’t simply dump that uncooked string into the database. As a substitute, the IngestAgent sends it to Haiku and asks, “What’s important here?”
The LLM extracts structured metadata:
{
"id": "a3f1c9d2-...",
"summary": "Alice confirmed Q3 budget approval of $2.4M",
"entities": ["Alice", "Q3 budget"],
"topics": ["finance", "meetings"],
"importance": 0.82,
"source": "notes",
"timestamp": "2026-03-27T14:23:15.123456+00:00",
"consolidated": 0
}The recollections desk holds these particular person information. At ~300 tokens per reminiscence when formatted right into a immediate (together with the metadata), the theoretical ceiling is round 650 recollections in Haiku’s 200K context window. I deliberately set the default to be 50 latest recollections, so I’m effectively wanting that ceiling.
When the ConsolidateAgent runs, it doesn’t simply summarize recollections. It causes over them. It finds patterns, attracts connections, and generates insights about what the recollections imply collectively. These insights get saved as separate information within the consolidations desk:
{
"id": "3c765a26-...",
"memory_ids": ["a3f1c9d2-...", "b7e4f8a1-...", "c9d2e5b3-..."],
"connections": "All three meetings with Alice mentioned budget concerns...",
"insights": "Budget oversight appears to be a recurring priority...",
"timestamp": "2026-03-27T14:28:00.000000+00:00"
}Whenever you question, the system masses each the uncooked recollections *and* the consolidation insights into the identical immediate. The LLM causes over each layers without delay, together with latest details plus synthesized patterns. That’s the way you get solutions like “Alice has raised budget concerns in three separate meetings [memory:a3f1c9d2, memory:b7e4f8a1] and the pattern suggests this is a high priority [consolidation:3c765a26].”
This two-table design is the complete persistence layer. A single SQLite file. No Redis. No Pinecone. No embedding pipeline. Simply structured information that an LLM can purpose over straight.
What the Consolidation Agent Really Does
Most reminiscence programs are purely retrieval. They retailer, search, and return related textual content. The consolidation agent works in another way; It reads a batch of unconsolidated recollections and asks, “What connects these?”, “What do these have in common?”, “How do these relate?”
These insights get written as a separate consolidations file. Whenever you question, you get each the uncooked recollections and the synthesized insights. The agent isn’t simply recalling. It’s reasoning.
The sleeping mind analogy from the unique Google implementation appear fairly correct. Throughout idle time, the system is processing moderately than simply ready. That is one thing I typically wrestle with when constructing brokers: how can I make them extra autonomous in order that they will work after I don’t, and this can be a good use of that “downtime”.
For a private instrument, this issues. “You’ve had three meetings with Alice this month, and all of them mentioned budget concerns” is extra helpful than three particular person recall hits.
The unique design used a easy threshold for consolidation: it waited for five recollections earlier than consolidating. That works for energetic use. However for those who’re solely ingesting sporadically, a observe right here, a picture there, you may wait days earlier than hitting the brink. In the meantime, these recollections sit unprocessed, and queries don’t profit from the consolidation agent’s sample recognition.
So, I made a decision so as to add two extra triggers. When the server begins, it checks for unconsolidated recollections from the earlier session and processes them instantly. No ready. And on a each day timer (configurable), it forces a consolidation move if something is ready, no matter whether or not the 5-memory threshold has been met. So even a single observe per week nonetheless will get consolidated inside 24 hours.
The unique threshold-based mode nonetheless runs for energetic use. However now there’s a security web beneath it. For those who’re actively ingesting, the brink catches it. For those who’re not, the each day move does. And on restart, nothing falls by means of the cracks.
File Watching and Change Detection
I’ve an Obsidian vault with lots of of notes, and I don’t need to manually ingest each. I need to level the watcher on the vault and let it deal with the remaining. That’s precisely what this does.
On startup, the watcher scans the listing and ingests every little thing it hasn’t seen earlier than. It runs two modes within the background: a fast scan each 60 seconds checks for brand new recordsdata (quick, no hash calculation, simply “is this path in the database?”), and a full scan each half-hour, calculates SHA256 hashes, and compares them to saved values. If a file has modified, the system deletes the outdated recollections, cleans up any consolidations that referenced them, re-ingests the brand new model, and updates the monitoring file. No duplicates. No stale knowledge.
For private observe workflows, the watcher covers what you’d anticipate:
- Textual content recordsdata (.txt, .md, .json, .csv, .log, .yaml, .yml)
- Photos (.png, .jpg, .jpeg, .gif, .webp), analyzed by way of Claude Haiku’s imaginative and prescient capabilities
- PDFs (.pdf), textual content extracted by way of PyPDF2
Recursive scanning and listing exclusions are configurable. Edit a observe in Obsidian, and inside half-hour, the agent’s reminiscence displays the change.
Why No Vector DB
Whether or not you want embeddings to your private notes boils down to 2 issues: what number of notes you’ve got and the way you need to search them.
Vector search is genuinely essential when you’ve got thousands and thousands of paperwork and might’t match the related ones in context. It’s a retrieval optimization for large-scale issues.
At private scale, you’re working with lots of of recollections, not thousands and thousands. Vector means you’re operating an embedding pipeline, paying for the API calls, managing the index, and implementing similarity search to resolve an issue {that a} 200K context window already solves.
Right here’s how I take into consideration the tradeoffs:
Complexity
Accuracy
Scale
I couldn’t justify having to setup and keep a vector database, even FAISS for the few notes that I generate.
On high of that, this new technique offers me higher accuracy for the best way I would like to go looking my notes.
Seeing It In Motion
Right here’s what utilizing it truly appears to be like like. Configuration is dealt with by way of a .env file with wise defaults. You may copy of the instance straight and begin utilizing it (assuming you’ve got run aws configure on you’re machine already).
cp .env.instance .envThen, begin the server with the file watcher energetic
./scripts/run-with-watcher.shCURL the /ingest endpoint with to check a pattern ingestion. That is choice, simply to show the way it works. You may skip this for those who’re establishing in an actual use case.
-H "Content-Type: application/json"
-d '{"text": "Met with Alice today. Q3 budget is approved, $2.4M.", "source": "notes"}'The response will appear like
{
"id": "a3f1c9d2-...",
"summary": "Alice confirmed Q3 budget approval of $2.4M.",
"entities": ["Alice", "Q3 budget"],
"topics": ["finance", "meetings"],
"importance": 0.82,
"source": "notes"
}To question it later CURL the question endpoint with
question?q=What+did+Alice+say+about+the+fundsOr use the CLI:
python cli.py ingest "Paris is the capital of France." --source wikipedia
python cli.py question "What do you know about France?"
python cli.py consolidate # set off manually
python cli.py standing # see reminiscence rely, consolidation stateMaking It Helpful Past CURL
curl works, however you’re not going to curve your reminiscence system at 2 am when you’ve got an concept, so the undertaking has two integration paths.
Claude Code / Kiro-CLI talent. I added a local talent that auto-activates when related. Say “remember that Alice approved the Q3 budget” and it shops it with out you needing to invoke something. Ask “what did Alice say about the budget?” subsequent week, and it checks reminiscence earlier than answering. It handles ingestion, queries, file uploads, and standing checks by means of pure dialog. That is how I work together with the reminiscence system most frequently, since I are inclined to reside in CC/Kiro more often than not.
CLI. For terminal customers or scripting
python cli.py ingest "Paris is the capital of France." --source wikipedia
python cli.py question "What do you know about France?"
python cli.py consolidate
python cli.py standing
python cli.py checklist --limit 10The CLI talks to the identical SQLite database, so you possibly can combine API, CLI, and talent utilization interchangeably. Ingest from a script, question from Claude Code, and test standing from the terminal. All of it hits the identical retailer.
What’s Subsequent
The excellent news, the system works, and I’m utilizing it at present, however listed below are a number of additions it may gain advantage from.
Significance-weighted question filtering. Proper now, the question agent reads the N most up-to-date recollections. Meaning outdated however essential recollections can get pushed out by latest noise. I need to filter by significance rating earlier than constructing the context, however I’m unsure but how aggressive to be. I don’t need a high-importance reminiscence from two months in the past to vanish simply because I ingested a bunch of assembly notes this week.
Metadata filtering. Equally, since every reminiscence has related metadata, I may use that metadata to filter out recollections which are clearly improper. If I’m asking questions on Alice, I don’t want any recollections that solely contain Bob or Charlie. For my use case, this could possibly be based mostly on my observe hierarchy, since I preserve notes aligned to clients and/or particular tasks.
Delete and replace endpoints. The shop is append-only proper now. That’s positive till you ingest one thing improper and wish to repair it. DELETE /reminiscence/{id} is an apparent hole. I simply haven’t wanted it badly sufficient but to construct it.
MCP integration. Wrapping this as an MCP server would let any Claude-compatible shopper use it as persistent reminiscence. That’s in all probability the highest-lift factor on this checklist, however it’s additionally essentially the most work.
Attempt It
The undertaking is up on GitHub as a part of an ongoing sequence I began, the place I implement analysis papers, discover modern concepts, and repurpose useful instruments for bedrock ().
It’s Python with no unique dependencies, simply boto3, FastAPI, and SQLite.
The default mannequin is `us.anthropic.claude-haiku-4-5-20251001-v1:0` (Bedrock cross-region inference profile), configurable by way of .env.
A observe on safety: the server has no authentication by default; it’s designed for native use. For those who expose it on a community, add auth first. The SQLite database will include every little thing you’ve ever ingested, so deal with it accordingly (chmod 600 reminiscence.db is an efficient begin).
For those who’re constructing private AI tooling and stalling on the reminiscence downside, this sample is price a glance. Let me know for those who determine to attempt it out, the way it works for you, and which undertaking you’re utilizing it on.
About
Nicholaus Lawson is a Answer Architect with a background in software program engineering and AIML. He has labored throughout many verticals, together with Industrial Automation, Well being Care, Monetary Companies, and Software program firms, from start-ups to massive enterprises.
This text and any opinions expressed by Nicholaus are his personal and never a mirrored image of his present, previous, or future employers or any of his colleagues or associates.
Be happy to attach with Nicholaus by way of LinkedIn at



