Indirect prompt injection in tool-calling agents: detection shape and the first week of tuning

By AutoCypher · 6 weeks ago 31 May 2026

By 2026 the prompt-injection conversation has moved off the chat box. The interesting attack surface is the agent that reads a Jira ticket, fetches a linked Confluence page, queries Snowflake, and posts a summary to Slack — all in one user turn, all stitched together by a Model Context Protocol server or an equivalent function-calling harness. The model has no privilege boundary between the instructions you gave it and the tokens it just pulled out of a customer-submitted PDF. That is the entire problem in one sentence, and it is the reason the OWASP LLM Top 10 has kept LLM01 (Prompt Injection) at the top through every revision.

What changed in the last eighteen months is not the attack. Greshake et al. described indirect injection cleanly back in 2023. What changed is the blast radius. An agent with read-only RAG over an internal wiki is a containment problem. An agent with fetch_url, read_email, query_warehouse, and post_message in the same tool registry is a data exfiltration vector that runs inside your trust boundary, authenticated as a service principal that probably has more scope than anyone audited.

This piece is about what defending that looks like operationally — what to log, what to alert on, where the first round of false positives will come from, and which environment assumptions change the answer.

The mechanism, briefly

An LLM does not distinguish between a system prompt, a user message, and the body of a retrieved document. They are all tokens. If an attacker can land tokens into the model’s context — by getting their text indexed into the RAG corpus, by sending an email the agent will summarize, by editing a web page the agent will browse, by planting a comment in a code review the agent will read — those tokens carry the same weight as your instructions. “Ignore previous instructions” is the toy version. The grown-up version is a paragraph of plausible-looking English that nudges the agent into calling send_email with the contents of the prior tool’s response as the body.

The defenses fall into three uncomfortable buckets. Input-side filtering, which catches the dumb attacks and misses the rest. Output-side policy, where you constrain what tools the agent can invoke and with what arguments. And telemetry, which is what most shops actually need to invest in first because the other two will fail and you need to know when.

Guardrail vendors will tell you their classifier catches 98% of injections. Maybe. The 2% that gets through is the one that matters, and the 98% number is measured against public benchmarks the attackers already trained around. Treat input filtering as speed-bump, not control.

What to log

If you are running an agent framework in production and you are not emitting structured logs at the tool-call boundary, fix that first. Nothing else in this post matters until you can answer: for any given response the agent produced, which documents fed the context, and which tool calls fired with what arguments. The minimum useful schema per agent turn:

agent.session_id and agent.turn_id so you can reconstruct the chain
agent.context.sources[] — every URI, document ID, ticket ID, or row identifier that contributed tokens to the prompt
agent.tool_call.name, agent.tool_call.args (redacted where they contain secrets), and a hash of the args
agent.tool_call.parent_source — which source document was being processed when this tool call decision was made, if your harness can attribute it
agent.identity — the service principal the agent is acting as, not the human who initiated the session

That last one trips people up. The agent runs as a robot account. Your existing SIEM correlation rules are written around human user identity. Splunk’s CIM user field will get populated with the service principal and every downstream join breaks unless you also carry the initiating human identity through. Add a custom field for agent.initiator and put it in the data model, otherwise your insider-threat analytics are going to attribute every agent action to svc-llm-prod and tell you nothing.

The detection that actually works

The single highest-signal detection I would build first is not a content classifier. It is a behavioral one: flag turns where the set of tool calls the agent makes is inconsistent with the user’s stated intent and the source provenance of the context.

In Splunk terms, something like:

index=agent_telemetry sourcetype=agent:tool_call
| stats values(tool_call.name) as tools
        values(context.sources) as sources
        by session_id turn_id
| eval external_source=if(match(mvjoin(sources,","),"https?://(?!corp\.example\.com)"),1,0)
| eval has_egress=if(match(mvjoin(tools,","),"(send_email|post_webhook|fetch_url|create_ticket)"),1,0)
| where external_source=1 AND has_egress=1

That catches the structural shape of indirect injection: the agent read something from outside the trust boundary, and in the same turn it invoked a tool capable of moving data outward. It will not catch the clever stuff. It will catch enough of the obvious stuff to be useful, and the obvious stuff is most of what shows up in real incidents.

Volume expectation, in a shop with maybe a hundred internal agent users and a tool registry of a dozen functions: you should expect this to fire in the low tens per day before tuning, dropping into single digits once you carve out the legitimate “summarize this external page and send it to me” workflow that every executive will inevitably ask for. If you are seeing hundreds of hits a day, your tool registry is too permissive — that is the finding, not the alert noise.

Where the false positives come from

Three sources, roughly in order of how much of your tuning week they will eat.

The first is the legitimate cross-system workflow. “Read this customer email, look up their account in the warehouse, draft a reply.” Structurally identical to an exfil chain. The fix is not a smarter detection — it is allowlisting specific tool-call sequences against specific user roles. Customer support agents get the email→warehouse→draft chain. Engineering agents do not. This is CM-7 (least functionality) and AC-3 (access enforcement) doing their actual jobs, and yes, it means maintaining a policy that someone has to update every time the product team ships a new agent capability.

The second is the agent that retries. When a tool call fails — timeout, 429, malformed args — most harnesses will retry, sometimes with a slightly different prompt. Your detection sees the same external-source + egress-tool pattern fire twice in two seconds and double-counts. Deduplicate on (session_id, tool_call.args_hash) within a short window. The LangGraph default retry behavior in particular will generate clusters of near-identical events; if you are on a different framework, check what its retry semantics actually do versus what the docs say (they often disagree, and the docs lag the code by a release or two).

The third is the long-running agent session where context accumulates. A session that has been running for forty turns has tokens in it from sources the user has long since forgotten. The agent might invoke a tool whose arguments derive from a document loaded twenty turns ago. Your parent_source attribution will be wrong, or absent, and the detection will look like a false positive because the “current” source is benign. The honest answer is that long-lived agent sessions are hard to defend and you should cap session length aggressively. Thirty minutes is generous. The product team will hate this. The product team is wrong.

Environment assumptions that change the answer

A RAG agent over a curated internal corpus with no write tools and no outbound fetch is a different risk profile from a browser-using agent loose on the open web, and the controls should not be the same. Some of the relevant axes:

Tool registry shape. Read-only versus read-write. Internal-only versus outbound-capable. The presence of any tool that can serialize content into an attacker-observable channel (email, webhook, DNS query, even an external image fetch) is the bright line. A pure-read agent is a confidentiality problem at worst; an agent with one outbound write tool is a full exfiltration problem.
Corpus trust level. Documents authored only by employees behind SSO are a different threat than a corpus that ingests customer-submitted content, support tickets, or anything scraped from the web. The latter requires treating every retrieved chunk as untrusted input and applying SI-10 (information input validation) at the chunk level, not the query level.
Model and harness. Frontier models from the major labs have meaningfully better resistance to obvious injection than open-weights models fine-tuned for instruction following without much safety training. This does not make them safe. It changes which attacks land. If you are running a self-hosted Llama variant for cost or sovereignty reasons, your detection thresholds should be tighter and your tool registry should be smaller.
Authorization model for tool calls. If the agent’s tools all run as the agent’s service principal, you are one injection away from full blast radius. If each tool call is re-authorized against the initiating user’s permissions — what some shops are calling “on-behalf-of” agent identity — the blast radius collapses to what that user could have done anyway. This is the single highest-leverage architectural change available, and most production deployments have not made it because it is expensive to retrofit.

That last point is where I would push hardest in an architecture review. Detections matter, but a containment architecture where the agent literally cannot exceed the initiating user’s authorization is worth more than any classifier.

Control mapping

The 800-53 families that actually apply, in the order they tend to matter:

Family	What it covers here
AC-3, AC-4	Tool-call access enforcement and information flow between context sources and tools
SI-10	Treating retrieved content as untrusted input requiring validation
AU-2, AU-12	The tool-call telemetry above; without this the rest is theatrical
CM-7	Least-functionality applied to the tool registry per agent role
SC-7	Boundary protection between the agent’s outbound tools and the network
RA-3	Risk assessment of each new tool added to the registry
SA-8	Security engineering principles applied to agent harness design

NIST AI 600-1 (the GenAI profile of the AI RMF) is the document to point at for governance conversations. It is not prescriptive enough to write detections from, but it gives you the vocabulary to argue for resourcing.

What most teams get wrong before they get it right

The most common early mistake is investing in a prompt-injection classifier as the primary control and treating tool-call telemetry as a future project. That is exactly backwards. Classifiers degrade silently as attackers iterate. Telemetry, once it exists, lets you investigate the incidents the classifier missed and feed the misses back into both your filter and your tool-registry policy.

The second most common mistake is letting the agent run as a service principal with broad scope because “it makes the demo work.” Every demo I have ever seen ships to production with the same scope. Pin it down at design time or accept that you are going to pin it down during an incident.

The third is treating agent sessions as ephemeral chat sessions when they are actually long-lived automation runs with accumulating state. Apply the same controls you would apply to any other service-to-service automation: identity, scope, audit, rate limit, kill switch. The fact that the orchestration layer is a language model is interesting from an attack-surface standpoint and irrelevant from a control standpoint. You still need the kill switch.

Defending these systems is not a new discipline. It is the old discipline, applied to a component that happens to make non-deterministic decisions about which tool to call next. The agent is a confused deputy with a thesaurus. Treat it accordingly.