Constant-Cost Persistent Semantic State Memory Engine for LLM Agents
Last Updated on May 29, 2026 by Editorial Team
Author(s): Michael Neuberger
Originally published on Towards AI.
Constant-Cost Persistent Semantic State Memory Engine for LLM Agents

If you’ve shipped anything with an LLM in the loop, you know the shape of the bill. Turn one is cheap. Turn fifty is not. Turn five hundred is something you start architecting around. The conversation history grows, the prompt grows, every call ships the entire past back into the model, and the cost curve is — almost insultingly — linear in the thing you actually want more of: useful interaction.
The standard answers are familiar. Truncate. Summarize. Stuff the recent N turns and pretend turn 1 didn’t matter. Build a vector store on the side and hope retrieval picks the right chunk. Each of these works, sort of, until it doesn’t.
Semvec is a different bet: replace the unbounded conversation history with a fixed-size semantic state plus a tiered, content-aware memory. Turn 10 and turn 10,000 carry the same input footprint. The agent still has structured access to decisions, invariants, error patterns, and prior context across sessions — but it pays for that access with a constant, not a growing line item.
This piece is a tour of what that means in practice, what it costs you (because nothing is free), and where it does and doesn’t fit.
The reframe: a fixed-size state, not a growing transcript
The mental model worth installing first: stop thinking of “memory” as “a longer prompt.” Think of it as a small vector that summarizes where the conversation is right now, plus a memory tier system that holds the relevant past out of band, plus a retrieval step that injects only what the next turn needs.
In Semvec terms:
- The semantic state is a single fixed-dimension vector. Every user turn updates it. The update is content-aware — the state absorbs more of a turn that introduces a new direction, and less of a turn that just re-confirms the current one.
- The memory is a three-tier structure (short / medium / long term, default capacities 15 / 50 / 200) that holds embedded turns and consolidated clusters. Promotion between tiers is driven by access patterns and importance, not just recency.
- Retrieval scores candidates by
cosine(query, memory) × tier_weight, optionally boosted by domain anchors and resonance triggers (more on those below). Only the top-K result rides into the next prompt.

The user-visible payoff lands in the published baselines. The current headline benchmark is LOCOMO (10 conversations, 1986 QAs), run in the mem0–1:1 evaluation setup — gpt-4o-mini as both reader and judge, temperature 0, with the judge prompt byte-identical to mem0’s harness. On the LLM-as-Judge metric over the 1540 non-adversarial QAs, Semvec scores 0.605 against mem0’s 0.669 — roughly 6pp behind mem0 on raw judge accuracy, ahead of LangMem, Zep, A-Mem, MemoryBank and Letta. The trade-off is structural and it runs the other way: Semvec’s ingest is pure mathematical EMA over the embedding: zero LLM calls per turn. mem0 runs LLM-driven fact extraction at ingest, at least one call per stored turn (and on workloads where its pipeline fans out, many more). That shows up in wall-clock: Semvec ingests the 675-turn conv-44 in ~3 minutes against mem0’s ~24.5 minutes, about 8× faster, and on the full 1986-QA suite finishes in ~95 minutes where mem0 takes 10–12 hours.
The token side tells the same constant-cost story. Semvec’s reader call carries ~2,000 input tokens against full-context replay’s ~16,000–20,700, roughly 8× fewer per turn (against gpt-4-turbo’s ~26k full context specifically, nearer 93% fewer). Per turn against mem0’s own reader context the two are roughly comparable; the cost gap lives at ingest, which is exactly where per-turn comparisons can’t see it. The honest one-liner is mem0-near quality at zero generative-LLM cost at ingest: you trade ~6 points of judge accuracy for a different cost class, determinism, and air-gappable deployment.
These aren’t curated single-shot demos. The numbers come from benchmarks/run_locomo.py, which ships with the wheel; you reproduce them on your model, your hardware, your data.
Selective forgetting is a feature, not a bug
The interesting thing about a finite memory is what it makes you decide.
When a tier overflows, you have to drop something. FIFO (drop the oldest) is the lazy default; it’s also wrong, because age and importance are not the same thing. Semvec’s default uses a composite retention score that weighs importance, recency, and access frequency together, the exact weighting is tuned for production workloads in the core, not a knob you turn. A frequently-touched older memory survives over a never-touched newer one.
This sounds obvious. It is not the default in most systems, because most systems treat memory as a log. Treating it as a working set, with eviction policy, is what lets the long-term tier stay useful past turn 1,000 instead of slowly turning into noise.
If you want pure recency you can flip a flag and get FIFO. The point is that the default expresses an opinion: what survives is what gets used.
Phases, anchors, triggers — the dials worth knowing
A few concepts you actually consume in code.
Phases. Every state update returns a phase label, picked automatically from six options: initialization, exploration, convergence, resonance, stability, instability. You don't configure these, you read them and react. Initialization means the state hasn't seen enough signal to be informative; skip the "summarize prior work" prompt, there is no prior work yet. Exploration means high novelty and weak memory alignment; lean on the LLM's general knowledge, retrieval has little to add. Stability means a long stretch of converged dialog; this is the moment to checkpoint to disk. The phases are a UX gift, not just an internal signal, they let you gate expensive operations on actual conversational state instead of guessing from turn count.
Anchors. Reference embeddings that pull retrieval toward a known domain. If you’re building a SAP integration assistant, register an anchor on something like "SAP Business One Service Layer OData REST API". Memories that align with that anchor win the tie-break against generic phrases. One anchor per domain you care about.
Resonance triggers. A different shape of bias: boost memories on a specific keyword or vector match. "security review" as a trigger means that the moment the user types that phrase, the state absorbs the input aggressively (β clamped to its minimum) and any memories tagged near that vector get a retrieval boost. Triggers can fire on substring match or on embedding-similarity threshold, you pick.
Anchors and triggers compose with max(), not addition, so redundant matches don't double-count.
The reason both exist: anchors are good for domains (“everything ERP-related”), triggers are good for events (“the moment compliance language enters the chat”). You can run both. On mixed-domain workloads, adding anchors lifts top-3 retrieval precision from 86% to 91.7% in the project’s measurements.
Correcting wrong memories
Here’s a question every memory library eventually has to answer: what happens when something it remembers is wrong?
Pure semantic memory knows nothing about true vs. false, only similar vs. different. “Most recent wins” is a tempting default and a poor one, because the wrong answer is often correlated with something the user mentions all the time, which silently boosts it back into retrieval. Semvec ships five independent mechanisms for correction; pick the cheapest one that solves your problem and compose from there.
- Recency-bias (default). When “newest is usually right” describes your workload, retention scoring handles updates implicitly. It covers more cases than you’d expect — and it breaks in ways that are interesting on their own.
- Per-trigger weights. A thumb on the scale: when a specific phrase fires, a specific memory wins. The hard part isn’t the knob, it’s knowing when a soft signal isn’t enough and a hard pin is.
- Negative attractors. The inverse of an anchor — a region in embedding space that retrieval pushes away from. The previous wrong answer becomes the thing competing memories are scored against. Underused, and the cleanest tool we have for “stop resurfacing this” without erasing it.
- Per-call source / confidence metadata. When ERP, user, and agent all write to the same memory and disagree, retrieval shouldn’t be the layer that picks a winner: your application should. Semvec gives you the hooks; the policy stays where it belongs.
- Hard event-store delete (GDPR Art. 17). Sometimes the answer isn’t “demote.” It’s “this never happened, and here’s an audit trail proving it.” Delete returns a signed certificate the customer can verify offline.
The ordering hides a deeper split: mechanisms 1–3 control what’s retrieved, 4–5 control what’s stored. Conflating those two layers is how memory systems end up with policies nobody can reason about a quarter later.
Verbatim facts: when semantics is the wrong tool
Semantic memory is great for meaning. It is dangerous for exact values.
If you compress “the next infusion is on 2026–05–15” into an embedding, a clinician later asking “when is the next infusion?” may receive “next month” — or, worse, the wrong date. If you compress “transfer 1,247.38 EUR to IBAN DE89 3704 0044 0532 0130 00”, you have just turned a financial instruction into a fuzzy memory that survives cosine similarity but loses every digit that mattered.
Semvec’s LiteralCache exists for exactly this. The compliance pack ships a fact extractor that pulls regex-recognised values — ISO/DE/US dates, EUR/USD/kg/% numerics, UUIDs, IBANs (with mod-97 checksum validation), VAT IDs — out of free text before the input gets folded into the EMA vector. Each fact lands in the literal cache and survives byte-for-byte across compression, persistence, and replay:
from semvec.compliance.extractors import extract_facts
text = "Mein Kontostand ist 1.247,38 € am 15.08.2026"
for fact in extract_facts(text):
print(fact.kind, fact)
# numeric NumericFact(value=Decimal('1247.38'), unit='EUR', ...)
# date DateFact(value=datetime(2026, 8, 15, tzinfo=UTC), ...)
Two things to notice. The numeric path uses Python’s Decimal, not float, Decimal('0.1') + Decimal('0.2') == Decimal('0.3') exactly, no roundtrip through binary floating-point. And the IBAN extractor runs the actual mod-97 checksum, so a typo in DE89...0001 doesn't get stored as a valid identifier.
This is the difference between a memory layer you can trust with regulated data and one that occasionally hallucinates a dosage at three in the morning.
What you can actually build with this
The core surface is small. The things you can stack on it are not.
A drop-in chat proxy. SemvecChatProxy wraps any OpenAI-compatible LLM endpoint and gives you compressed context for free. Same client interface, smaller bill. Useful as a first contact with the library — you can A/B it against your existing setup without rewriting the agent loop, and the proxy returns the compressed-prompt token count alongside the full-history baseline for every call, so you measure your own savings on your own questions.
Skip the LLM call entirely on paraphrases. Semvec’s short_circuit flag fires when the new query is a paraphrase of a recently-answered one. With a configurable threshold (typically 0.85+), the cached answer is returned directly and the LLM round-trip is skipped. On chatty conversational workloads this stacks on top of the token reduction. You're not just paying less per call, you're making fewer calls.
Persistent coding-agent memory. This is where the design earns its keep. Coding agents lose their minds across session boundaries: the same architectural decision gets re-derived, the same dependency footgun gets re-stepped on, the same bug fix gets re-discovered. Semvec’s coding module ships:
- a
LiteralCachefor things that should survive verbatim (design decisions, invariants, recurring error patterns with fixes, parsed code structures), - a
CodePointerIndex— a semantic-similarity index over(file_path, signature, intent)triples — so the agent can find the function that handles password reset rather than the specific line of last week's code, - a
NegativeAttractorSetfor anti-resonance: register past failures, and the agent's next proposal gets flagged when it resembles them, with severity decay so a year-old failure doesn't drown out today's signal, - a
PromptBuilderthat produces a token-budgeted Markdown context block, dropping low-priority sections under tight budgets rather than truncating mid-thought.
There are full integration guides for Claude Code (MCP server with automatic SessionStart and PreCompact lifecycle hooks, the agent never has to remember to save) and Cursor (MCP server plus a project rule). Or run the CodingEngine directly from a CI job to ingest transcripts in batch.
Multi-agent coordination. The cortex module runs several agents that share an aggregated state view, vote on proposals, and exchange checksummed state vectors. Available in-process, as a service, or over REST. Worth its own deep-dive — there's one coming, with Neo4j as the storage substrate.
A REST server. pip install "semvec[api]" and semvec serve exposes the full surface over FastAPI. If your stack is not Python, you talk HTTP.
Honest about the parts that matter
A few things worth saying out loud, because they tend to be where memory libraries quietly cheat.
Bring your own embedder, no silent fallback. Anything exposing get_embedding(text) → np.ndarray and get_dimension() → int works — SentenceTransformers (all-MiniLM-L6-v2 for fast, all-mpnet-base-v2 for quality, paraphrase-multilingual-MiniLM-L12-v2 for non-English), OpenAI, ONNX int8 (4× smaller, 2-3× faster on CPU at < 0.5 pp accuracy loss, the right call for serverless and edge), or your own. If the embedder is missing or broken, you get a descriptive RuntimeError. You do not get random-noise pseudo-vectors that look like they're working until your retrieval starts returning garbage. This is the single most important promise a memory library can make and an alarming number of them don't.
Determinism is a release gate, not a hope. Every Semvec build is held to a documented parity envelope: token-savings ratio identical to three decimal places, phase-detector decisions bit-identical on identical input, serializer output byte-identical on short haystacks, network resonance within machine epsilon (1.1 × 10⁻¹⁶). Two replays of the same event stream produce the same semantic_state. This is what makes the system viable for audit reconstruction and it’s what most memory libraries quietly don’t have.
Persistence round-trips are checksummed. The serialized state includes an integrity checksum. Tampered or corrupted snapshots raise StateCorruptionError on restore — not "kind of work, sort of, until they don't." Two formats: JSON-safe to_dict() for systems that only speak JSON, and a binary to_bytes() (~2.4× smaller compressed) for cold storage.
Prompt injection has a semantic-layer answer. A session anchored to a specific domain can be put in QUARANTINE isolation: inputs whose embedding similarity to the domain anchor falls below a threshold get filtered before they reach the LLM. If your medical-information agent is talking about chemotherapy and someone slips in a question about credit-card numbers, the isolation filter catches it at the input boundary, not after the model has already started composing an answer.
The benchmark harness ships with the wheel. benchmarks/run_locomo.py runs the LOCOMO harness against an OpenAI-compatible endpoint with an LLM-as-judge pipeline and a mem0 baseline option. Per-question accuracy by category, token-savings ratio, wall-clock cost per entry. You don’t have to take the README’s word for it — you reproduce the numbers on your own data and your own model.
One wheel, Python 3.10–3.14, prebuilt. Linux x86_64/aarch64, macOS x86_64/arm64, Windows x86_64. pip install semvec. No build chain, no compilers, no "but does it work on Windows."
Built like infrastructure, not like a library
The compliance pack — pip install "semvec[compliance]" — is where Semvec stops being a Python package and starts being something you can put under a regulated workload.
The architecture move worth understanding: the event store is authoritative; the three-tier memory, the EMA vector, and the literal cache are all derived views. A reset of the semantic state doesn’t lose information — replay rebuilds it from the events. A deletion in the event store is the only way to genuinely forget something, and it propagates async to a vector-rebuild worker that brings the running session back into consistency without blocking the request path.
What you get on top:
- Append-only event store with deterministic replay. Every memory mutation is reconstructable from the audit log.
- 30-day retention sweeper as a cron-friendly job that purges anything past
retention_daysand writes an audit record per affected user. Idempotent. - GDPR Art. 17 forget with a signed
DeletionCertificate(RSA-PSS-SHA256, 256-byte signatures, RSA chosen over Ed25519 for HSM and compliance-tooling compatibility). Customers can verify the certificate offline against the wheel-embedded operator public key. - HMAC request signing in the AWS-SigV4 style —
(METHOD, PATH, SHA256(body), TIMESTAMP, NONCE)canonical, HMAC-SHA256, constant-time verify, replay defence via a nonce cache. - RS256 user JWTs — per-user public key registered server-side, private key never leaves the client. The server cannot forge tokens.
- Feature flags everywhere. Every compliance capability ships behind a
SEMVEC_ENABLE_*environment variable, defaulting to off. An existing deployment that importssemvecdoes not pick up new behaviour by accident.
This is the boring infrastructure you need for the moment a B2B prospect’s procurement team starts asking real questions. It’s nice that it’s already there.
Getting started, in the smallest unit of useful
pip install semvec
from semvec import SemvecState, SemvecConfig
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
class STEmbedder:
def get_embedding(self, text): return model.encode(text)
def get_dimension(self): return model.get_sentence_embedding_dimension()
embedder = STEmbedder()
state = SemvecState(config=SemvecConfig(dimension=768))
# Each turn: update the state, get back metrics, optionally retrieve.
result = state.update(embedder.get_embedding(user_text), user_text)
print(result["phase"], result["fsm"], result["topic_switch"])
# When you're about to call your LLM:
relevant = state.memory.get_relevant_memories(query_vec, top_k=5)
That’s the floor. The ceiling is the rest of the docs — REST endpoints, Cortex coordination, the LiteralCache, lifecycle hooks for Claude Code and Cursor, the compliance pack.
When this is the right tool
Use Semvec when at least two of these are true:
- Your conversations or sessions go past a few dozen turns.
- You’re paying per input token and the bill is starting to be a planning problem.
- You need memory to survive across sessions, not just within one.
- You care about which old context comes back, not just some old context.
- You handle exact values — dates, amounts, identifiers — that must round-trip without a fuzzy approximation.
- You’ll need to explain to a procurement team how forget-me requests are handled.
Use something simpler — straight retrieval on a vector DB, or just a sliding window — if your sessions are short, your conversations are stateless, and your cost ceiling is comfortable. Semvec is built for the workloads where memory becomes infrastructure, not where it’s a nice-to-have.

Closing
The thing I keep coming back to is that “more context” is a local maximum. The interesting move isn’t bigger windows or longer transcripts: it’s a memory architecture where the cost of remembering doesn’t scale with the amount you remember, where wrong information has five distinct ways to be corrected, where exact values survive compression byte-for-byte, and where every state in the system can be reconstructed from an event log on demand.
Semvec is one shape of that bet. Docs at https://semvec-docs.pages.dev, package on PyPI (https://pypi.org/project/semvec/), pricing and tiers at semvec.io.
If you build something with it, I’d genuinely like to hear about it.
Semvec is developed by Versino PsiOmega GmbH. Four non-provisional patent applications are pending: EP 25 188 105, EP 26 160 795, US 19/269,195, US 19/550,466. Until grant, features describe claims of pending applications, not enforceable exclusive rights.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.