Indirect Prompt Injection at the Tool-Response Boundary: What the AI Gateway Actually Sees
The interesting AI security problem in 2026 isn’t the chat box anymore. It’s the second turn. The first turn is the user, who you can at least claim to authenticate and rate-limit. The second turn is a tool response — a Confluence page, a Jira comment, an Outlook draft, an S3 object, a row in a vendor’s MCP server — and that content gets fed back into the model’s context as if it were trusted narration of the world. It isn’t. It’s attacker-reachable text. And in most agent stacks I’ve looked at, the gateway treats it the same as the system prompt by the time the next model call goes out.
That’s the shape of the problem. Indirect prompt injection isn’t a clever jailbreak; it’s a trust-boundary failure dressed up as a content-safety failure. Treating it as a content problem (run a classifier on the user message) is why the existing guardrail vendors keep missing it: the malicious instruction isn’t in the user message at all.
Where the boundary actually lives
In a typical 2026 stack — LiteLLM or a similar proxy in front, an MCP host orchestrating tools, a model endpoint behind that — the request lifecycle looks like a loop. The model emits a tool call. The host executes it. The tool’s response gets appended to the conversation as a tool role message. The model then reasons over that combined context and may emit another tool call, or a final assistant message. The loop continues until the model decides it’s done or the orchestrator hits a turn limit.
The injection lives in step three. Anything reachable by the tool — any wiki page, any ticket, any email body, any web fetch result, any vector-store chunk — becomes instruction-shaped text inside the model’s working memory. The model has no robust way to distinguish “this is data the user asked me to summarize” from “this is an instruction I should follow.” Anthropic, OpenAI, and Google have all shipped partial mitigations (structured tool schemas, tool-result tagging, the constitutional-AI-style refusals) and none of them close the gap. They reduce the rate. They don’t eliminate it.
Which means the defender’s job is detection and containment at the gateway, not prevention at the model.
What’s actually in your logs
If you’re running LiteLLM as the proxy, the JSON it emits per request has the fields you care about, though the schema has shifted across releases — the names below were accurate as of the 1.5x line; if you’re on something newer double-check. The ones worth pinning to your SIEM:
messages[].role— particularly countingtoolmessages perrequest_idmessages[].contentlength when role istooltool_calls[].function.nameand.argumentsmodel,user,metadata.agent_session_idresponse.choices[0].message.tool_callsfor the next-hop call the model is about to make
The single most useful derived field is the ratio of tool-response character count to subsequent tool-call argument length. An injection that actually fires usually shows up as: a large tool response comes in (a fetched page, an email body), and the very next model output is a tool call whose arguments contain text that wasn’t in the user’s original prompt and wasn’t in the system prompt. That’s the signal. The model just decided, on its own, to take an action whose parameters derived from third-party content.
In Splunk, the search I’d start with — assuming you’ve got the proxy logs in a sourcetype like litellm:proxy — is something close to a stats by agent_session_id counting tool turns, joined against the next-turn tool-call arguments, with a regex check for argument tokens that don’t appear in the originating user message. Elastic’s equivalent is messier because you end up doing the cross-turn correlation in a transform or a runtime field, and runtime fields on high-cardinality session IDs will cost you. If you’re on Splunk Cloud, the cross-turn join wants a summary index; doing it at search time on a busy proxy will time out before it returns.
Volume reality check: a single user running a research agent will generate 8–40 tool turns per task. A 500-seat deployment of something like a Glean-style enterprise assistant is going to push something in the range of 50–200k tool-response events per day. You are not going to alert on raw tool-response content. You alert on the cross-turn anomaly.
The first week of alerts
The first round of tuning is always the same fight. The detection fires constantly on legitimate behavior, because legitimate behavior looks exactly like the attack from the gateway’s point of view. The user says “summarize this page and email me the highlights.” The agent fetches the page (tool response), then issues a send_email tool call whose arguments contain text that came from the page and not from the user. Textbook injection signature. Also textbook user intent.
What you have to carve out, in roughly this order:
The “summarize and forward” pattern. Any tool call whose name matches a known summarization-adjacent output sink (send_email, create_doc, post_message, update_ticket) where the user’s original message contains intent verbs like summarize, forward, draft, paraphrase. This is maybe 60% of your day-one false positives. Carve it as a suppression, not a whitelist — you still want the event in the index, just not paging anyone.
The vector-store chunk problem. RAG responses are tool responses too, and they contain whatever was in the indexed corpus. If the corpus includes user-generated content (support tickets, internal forum posts, Slack exports) the injection surface is already inside your trust boundary. The detection will fire on chunks that contain imperative-mood text, which is most of any helpdesk corpus. You either tag the retrieval tool separately and raise its threshold, or you do the harder work and run a content classifier on ingest into the vector store — which is the right answer but nobody’s budget covers it in year one.
Multi-turn agentic loops that legitimately escalate. A coding agent that fetches a README, then runs a build, then reads the error log, then edits a file — every one of those transitions looks like “action derived from tool content.” The carve-out here is by tool category: file-read and file-write inside a known sandboxed workspace path don’t count. Anything that reaches an external network endpoint or an identity-bearing API does.
After those three carve-outs you’ll typically still have a 3–8% false positive rate on the remaining alerts, and that’s the band where the detection becomes useful. Below that and you’re missing real events; above that and the SOC stops reading.
What most teams get wrong
The first mistake is buying a prompt-injection classifier and bolting it to the user-input side of the proxy. Lakera, Prompt Guard, NeMo Guardrails, the various open-source PromptShield forks — they all do roughly the same thing, which is run a small model over the input string and score it. That’s fine for direct injection in a chat UI. It does almost nothing for the indirect case, because the malicious content arrives via a tool response that the classifier was never positioned to see. If you want a classifier in the pipeline, put it on the tool-response side, scoring each tool role message before it’s appended to context. That changes the latency profile of every agent turn, which is why most deployments don’t do it, which is why most deployments are exposed.
The second mistake is trusting the model’s own refusal behavior as a control. It is not a control. It is a probabilistic mitigation that vendors will quietly regress when they ship a new base model, and your detection will go dark the day they do. Treat refusal rates as telemetry, not as defense. Track the refusal rate as a baseline metric; alert on a sudden drop, because that often means the underlying model got swapped under you (the API version pinning story across the major vendors is still a mess as of this writing — gpt-4o-2024-08-06 is not the same animal as gpt-4o and your contract may or may not let you pin).
The third mistake — and this one’s the worst — is treating the agent’s tool permissions as if they were the user’s permissions. They are not. The model can be coerced into issuing tool calls the user would never have approved, with the user’s OAuth token. If your send_email tool is wired to the user’s mailbox via delegated auth, an injection that triggers it sends from the user. Your audit log will show the user sent the mail. The fix is human-in-the-loop confirmation on any tool whose effects are externally visible or irreversible, scoped by tool category, not a blanket prompt that says “ask before acting” (which the model will ignore the moment the injected content tells it to).
Where this lands in 800-53
The control mapping is more useful than people give it credit for, because it forces the conversation away from “AI safety” abstraction into something an ISSO can write a finding against.
| Control family | Where it bites |
|---|---|
| SI-10 / SI-15 | Input validation and output filtering — the tool-response classifier sits here, not on user input |
| AC-3 / AC-6 | Tool permissions distinct from user permissions; least privilege on agent identities |
| AU-2 / AU-12 | The gateway logs above; per-turn, per-tool, per-session, with the cross-turn correlation preserved |
| SC-7 | Egress control on tool endpoints — an agent that can reach arbitrary URLs is an agent that can be exfiltrated through |
| CM-7 | Tool inventory and least-functionality. Most agent stacks expose tools nobody audited |
| RA-5 | Vulnerability scanning has to include the MCP server inventory, which most scanners don’t know about yet |
| SR-3 / SR-11 | Third-party MCP servers are supply chain. Treat them as such |
SR is the one most programs are sleeping on. The 2025–2026 explosion of community MCP servers means agents are pulling in tool definitions from sources with the same trust posture as a random npm package. The tool description itself is part of the model’s context. A poisoned tool description is a prompt injection delivered through the supply chain, fires on every agent that loads the tool, and won’t show up in any of the detections above because it lives in the system prompt assembly, not the per-turn tool response.
Closing
Indirect prompt injection isn’t going to be solved at the model layer in 2026, and probably not in 2027 either. The realistic posture is: assume the model will be coerced, instrument the gateway so you can see when it happens, and put the actual control where the actual blast radius is — at the tool permission boundary and the egress edge. The detection is doable. The carve-outs are predictable. The thing most teams haven’t internalized yet is that the trust boundary in an agent stack runs through the tool-response message, and every architectural decision has to follow from that.