Agent Memory Poisoning Survives the Session. The Detection Lives in the Write Path Nobody Logs

By AutoCypher · 0 days ago 30 Jun 2026

Single-turn prompt injection has one mercy: it usually dies when the context window clears. The malicious instruction lives for a turn, maybe a conversation, and then the session resets and the agent forgets it ever happened — unless it got copied into some other persistent surface first, a conversation summary, a saved transcript, a generated file. Long-term memory makes that persistence a first-class feature rather than an accident. Memory poisoning removes that mercy. The whole reason an agentic system is worth running — that it remembers your preferences, your past tasks, the shape of your repo, the fact that you always want the diff before the commit — is the exact mechanism that turns one successful injection into a standing instruction. Write the payload into long-term memory once, and it executes every time a future query semantically retrieves it. OWASP made the split official on December 9, 2025, breaking Memory & Context Poisoning out as ASI06 in the first Top 10 for Agentic Applications. The reason it earned its own slot is the one that should worry you: the controls you built for prompt injection sit at the wrong layer to catch a payload that already lives inside your trusted state.

That last point is the whole game, so sit with it. Your prompt-boundary input filter inspects what the user sends this turn. Your output guard inspects what the model emits. Neither of them is looking at the agent’s memory store, because by the time a poisoned memory is retrieved and folded into context, it isn’t untrusted input anymore. It’s the agent’s own recollection. It crossed the trust boundary at write time, days or weeks before it ever influenced an action.

How the payload gets in, and why it waits

The academic work here is no longer speculative. The MINJA attack (arXiv:2503.03704, “Memory Injection Attacks on LLM Agents via Query-Only Interaction”) demonstrates that an attacker who can only send the agent ordinary queries — no access to the memory backend, no privileged API — can get malicious records committed to the memory bank at over 95% injection success across every agent and dataset tested. Whether the planted record then actually hijacks the target query is more variable: attack success runs past 70% on most datasets but swings by setting, from roughly 57% on MIMIC-III to about 90% on eICU. Benign-task accuracy barely moves on most datasets — under 2% degradation — with one exception, MMLU, which drops about 10% because the malicious demonstrations crowd out the benign in-context examples, not because of the retrieval itself. These are the paper’s own figures, from controlled evaluation rather than a production deployment. That near-invisibility on benign tasks is the nasty part. The agent keeps working normally. Nothing looks broken. A more recent line of work, MemoryGraft (arXiv:2512.16962), goes after experience-style long-term memory rather than the factual store: it implants “successful experiences” so the agent imitates a poisoned reasoning trace the next time a similar task shows up, exploiting the semantic-imitation heuristic that makes experience replay useful in the first place.

Two properties make this different from anything in your existing injection playbook.

The first is temporal decoupling. The write and the trigger are separated in time, sometimes by weeks. Poison planted on a Tuesday fires when some unrelated future query happens to retrieve it by semantic similarity. Your IR timeline, which assumes the malicious input and the malicious action are close together, has nothing to anchor on.

The second is that the poison rides in through content the agent was supposed to read. An agent doing real work ingests tool outputs, fetched web pages, email bodies, retrieved document chunks, the JSON a third-party API hands back. Any of those can carry text the agent distills into a durable “fact” or “preference.” The user never typed the instruction. The agent learned it, from a source it had every reason to consult.

What you actually have to log

Here is where most teams discover they have no detection because they have no data. The memory store — pgvector on Postgres, Qdrant, a Redis vector index, a managed mem0 or Letta deployment — is treated as application-internal state. It is not a security event source. There is no agent:memory:write sourcetype in your Splunk index because the application never emits one. So before you write a single detection, you are building the event.

What that event needs, at minimum: the memory ID, the session and user scope the write came from, a content hash, and the one field that does all the work — provenance. Where did this memory originate? Tag every commit as user_input, tool_output, retrieved_doc, or agent_inferred — and carry the source’s trust level alongside the type, because a tool_output distilled from an internal read-only database is a different risk than one lifted from a fetched web page or an email body, and the low-trust end — anything whose integrity you can’t assert at write time, external or a compromised internal source — is where the poison actually comes from. (agent_inferred is the slippery tag: be strict about what counts as the agent’s own distillation versus a near-verbatim lift from a tool result, or it stops meaning anything.) mem0 and Letta will happily store the content and its embedding, and both can carry metadata and tags if your orchestration passes them; what neither does out of the box is automatically classify “this fact was distilled from a tool result versus typed by the human.” You are adding that plumbing yourself, at the add()/persist boundary, and that work is the part nobody scopes into the project. (The frameworks were built to make memory easy to write, not easy to audit. As of the 2025 releases anyway; could have moved.)

Once the event exists, the naive detection writes itself, and it’s wrong:

sourcetype=agent:memory:write
  mem_text IN (imperative trigger patterns: "always", "from now on",
               "whenever you", "ignore previous", "instead of")

Run that on a platform with, say, 4,000 daily active agent users each committing six to twelve memory entries a day, and you’re scanning 24,000 to 48,000 writes a day. (Numbers illustrative — the real volume depends entirely on how aggressively your platform writes preferences and summaries.) Imperative phrasing hits maybe 10–15% of them, because that is what a preference looks like. “From now on reply in metric.” “Always show me the test output first.” Those are legitimate, desirable memory writes. You’ve just built an alert that fires several thousand times a day on your product working as designed. Dead on arrival. And the false-positive flood is only half the problem — the rule is trivially evaded, too. Spell the trigger with Unicode lookalikes or tuck it inside encoded text and it sails straight past, so even the hits it does land are the least sophisticated attackers.

The first round of tuning

The signal was never the imperative form. It’s the provenance. Filter to writes where mem_source is tool_output or retrieved_doc — content the agent ingested rather than the user authored — and the imperative-pattern rule drops from thousands of hits to something a human can look at.

But it still over-fires, and you need to understand why before you trust it. Agents legitimately distill durable facts from tool outputs all the time. “This repo uses pnpm” learned from reading package.json is a tool_output-derived memory write, it’s completely benign, and it’s the feature working. So provenance narrows the field but doesn’t close it. The thing that separates a poisoned write from a useful one isn’t the source and isn’t the imperative mood by itself; it’s content that is instruction-shaped and aimed at the agent’s future behavior rather than a fact about the world. Detecting that reliably is itself an LLM classification call, which means latency and cost on the write path — and a synchronous classifier in front of every write is a new availability dependency, so in practice you run it out of band: let the write land, classify asynchronously, and quarantine or flag whatever scores high. That asynchrony buys availability at the price of a window: the memory is live and retrievable the instant it lands, so a poison that triggers fast enough fires before the classifier ever scores it — temporal decoupling makes that window survivable in the common case, not in every one. It carries its own false-positive rate that you now have to tune against, and — because it has to read the untrusted content in order to judge it — the classifier is itself a prompt-injection target, an arms race nested inside the one you started with. There’s no clean regex for “this sentence is trying to change what you do later.” Expect the first tuning pass to land you somewhere imperfect, with a residual FP rate you manage rather than eliminate, and a small allowlist of known-good distillation patterns that grows for a few weeks before it stabilizes. One class is worth pulling out of the async model entirely: for the highest-risk writes — behavioral instructions aimed at the agent (“always,” “never,” “ignore,” “send,” “delete”) and anything privilege-affecting, especially when the provenance is tool_output or retrieved_doc — stage or quarantine the write before it goes live rather than letting it land and scoring it after. You spend a little latency on a small slice of writes to close the fire-before-classified window exactly where it would do the most damage.

Then there’s the read side, which is where you catch the trigger even if you missed the write. Instrument retrieval: when a memory entry is recalled into context, log the mem_id and correlate it against the action the agent takes next. A specific memory that keeps preceding anomalous tool calls — an exfil-shaped API request, a privilege-touching operation, an outbound message to a destination the user never named — is a far stronger signal than anything you’ll get inspecting writes in isolation. This is the detection that actually survives contact with a clever attacker, because it doesn’t depend on the poison looking suspicious. It depends on what the poison makes the agent do.

Retention is a real budget fight here

Temporal decoupling collides directly with your indexing costs. Poison at day zero, semantic trigger at day 45. If your agent:memory:write source sits in hot/warm for 30 days and ages out by 45, then on the day the bad action finally fires you have no provenance trail for the memory that caused it. The forensic question — where did this belief come from, and through which session — is unanswerable because the write event is gone.

So retention on this specific source has to outlast your longest plausible dwell time, and that fights you, because memory-write volume is high and most of it is boring. You are paying to keep tens of thousands of mostly-useless events a day, warm, for longer than your default policy, against the chance that a handful of them are the origin story for an incident you haven’t detected yet. That’s a defensible spend, but it is a spend, and it won’t survive a budget review unless someone can articulate exactly this scenario. And the cost isn’t only the storage line: holding memory-write content warm for months enlarges your PII and data-subject-request surface at the same time, so hash or tokenize the content you keep and hold the long-lived copy down to the fields you genuinely need for attribution — IDs, provenance, source and trust level, hashes — not the raw remembered text.

Where the environment changes the answer

Blast radius is entirely a function of memory scope, and this is the variable that flips the severity.

Per-user, per-session scoped memory contains the damage: a poisoned entry hurts one user’s agent and nobody else’s. A shared semantic store across tenants — one fact base every customer’s agent reads from and writes to — turns a single injection into a cross-tenant contamination event, and now it’s a confidentiality problem stacked on the integrity one. Per-tenant, per-session isolation should be the default production architecture here, and where you already have it, keeping it that way is the cheap structural win. The genuine worst case is a self-updating knowledge base the agent both retrieves from and writes back to without mediation, because the poison compounds: each retrieval that acts on the bad fact can generate a new write that reinforces it. At the other end, a single-user local coding assistant with project memory has a tiny blast radius and, not coincidentally, zero SOC visibility, because nobody is shipping that agent’s memory writes to a SIEM at all.

Concern	NIST 800-53 control	What it actually means here
Logging memory writes with provenance	AU-2, AU-3	The `agent:memory:write` event with a `mem_source` field is the prerequisite for everything else
Retention past dwell time	AU-11	Keep the write log warm longer than your longest plausible decoupling window
Monitoring writes and retrieval-to-action	SI-4	Correlate recalled `mem_id` against the next tool call, not just inspect writes
Mediating ingested content into memory	AC-4, SI-10	Tool-output and retrieved-doc content should not flow into trusted memory unmediated
Scoping who/what can write memory	AC-3, AC-6	Per-tenant isolation is the difference between one victim and all of them
Memory integrity	SI-7	Content hashing, plus append-only or signed write logs — a checksum on mutable rows isn’t tamper-evident on its own
Third-party memory frameworks, shared corpora, MCP (Model Context Protocol) tools feeding memory	SR-3, SR-11	The memory layer is supply chain; so is every tool whose output becomes a remembered fact
Protecting remembered content at rest	SC-28	Memory stores hold sensitive remembered facts and often PII; encrypt/tokenize at rest and keep the long-lived copy hashed
Privacy of long-lived memory-write logs	PT-2, PT-3 (privacy)	Months of warm memory-write content is a PII and data-subject-request surface; minimize and hash what you retain

Those mappings are a suggested alignment for reasoning about coverage, not official NIST guidance — 800-53 doesn’t name agent memory, so read the right-hand column as the operative part.

The uncomfortable conclusion is that you can’t filter your way out of this at the prompt boundary, because the payload is inside the trust boundary by the time it matters. Memory isn’t a side feature you can disable; it’s the product. So the move is to stop treating the memory store as opaque application state and start treating it as what it is: a logged, access-scoped, provenance-tagged data store whose write and read events belong in the SIEM, with retention that outlasts your dwell time. Do that, and a poisoned memory becomes an event you can hunt. Skip it, and one good injection becomes a standing instruction you find weeks later, by accident, after it has already done whatever it was planted to do.

How the payload gets in, and why it waits

What you actually have to log

The first round of tuning

Retention is a real budget fight here

Where the environment changes the answer

Sources