Tool-Call Divergence: Detecting Indirect Prompt Injection in Agentic Systems
The shift from chat-style LLMs to tool-calling agents — MCP servers, browser-using agents, RAG pipelines that actually take actions — moved prompt injection from “make the model say something embarrassing” to “make the model do something on your behalf.” The detection problem changed shape with it. For a SOC lead inheriting an agent platform in 2026, the right mental model is closer to insider-threat monitoring than content moderation. You are watching an authenticated principal take privileged actions based on instructions you cannot fully see.
That reframing is the whole post, really. The rest is what falls out of it.
How the injection actually lands
Indirect prompt injection means the malicious instruction does not come from the user. It comes from a document the agent retrieved, a webpage the browser-agent loaded, a Jira ticket it was asked to summarize, an email in the inbox the assistant has read access to, a tool-result returned as part of a chain. The model’s policy boundary between “data to process” and “instructions to follow” is, frankly, not real — every serious red team has shown this and the labs have stopped pretending otherwise. Constitutional training and system-prompt hardening reduce the rate. They do not eliminate it.
The 2026 version of this problem is shaped by MCP. Tool-poisoning — where a malicious MCP server returns tool descriptions or tool results containing hidden instructions — became a named class of attack last year. The spec patched some of the worst surface (description signing, tighter schemas) but the underlying issue is unchanged: any content the model ingests is a candidate instruction channel. Including the tool catalog itself.
What detection looks like in the SIEM
Three families of signal survive contact with real data.
Tool-call divergence from user intent. The user asked the agent to summarize a customer ticket. Six tool calls later the agent invoked send_email to an external domain. That sequence is the signal. You need the user prompt, the system prompt, the full tool-call trace, and ideally a small classifier scoring “do the tool calls plausibly serve the stated user intent.” OpenTelemetry’s GenAI semantic conventions (the gen_ai.* attributes) finally made this not-horrible to ingest in late 2024; if your platform team is still rolling their own JSON blobs in 2026, that is the first fight to pick.
In Splunk terms — assuming you’ve got the OTel collector landing spans into a gen_ai_traces index — the fields you care about most are gen_ai.tool.name correlated with gen_ai.prompt and the originating gen_ai.conversation.id. The detection is roughly “tool X called within conversation Y where the initial user prompt scored below ~0.4 on intent-match against tool X.” That classifier is the part nobody wants to own. LLM-as-judge with a small calibration set is where most teams end up, with all the cost and latency implications that implies.
Sensitive-tool invocation after untrusted-content ingestion. Simpler signal, easier to build, lower false-positive rate. Tag every tool in your registry with a sensitivity score and a consumes_external_content flag — anything that fetches URLs, reads email, queries a vector store with documents from outside the trust boundary. The rule: any high-sensitivity tool call (send_email, create_jira_ticket, anything that hits a write API) that occurs in the same conversation as a prior untrusted-content tool result. In a 200-seat shop expect this to land in the single digits per day after the first tuning pass, more during business hours when humans are actually working with the assistant.
Egress and exfiltration patterns in tool outputs. This is where most of the actual exploitation in 2025 happened, and where I would put detection effort first if I were standing this up cold.
The markdown-image exfiltration tell
The dominant exfil channel for agent attacks last year was not novel: an injected instruction tells the model to render markdown referencing an attacker-controlled image URL with sensitive data encoded into the path. The model emits the markdown. The client renders the image. The attacker’s server logs the GET. Done.
This works in any agent UI that auto-renders markdown without scrubbing — which, depressingly, includes a number of vendor consoles that shipped with the box checked. Client-layer mitigation is straightforward (CSP, image proxy, strict domain allowlist) and the well-known providers patched their own surfaces. The problem is that your shop probably wrote three internal agent UIs and at least one of them does not.
Detection signals, in priority order:
- Outbound HTTP from the agent runtime to domains not on the tool-registry allowlist. Pull this from your egress proxy logs (Zscaler, Squid, whatever you have) keyed on the agent service account. If your agents run on EKS with the AWS VPC CNI and you don’t have egress filtering on the workload SG, that’s the bigger problem and you should fix it before you bother with detection.
- Markdown image references in model outputs where the URL has base64-shaped strings, long opaque query parameters, or path segments with high entropy. The regex is ugly and your false-positive rate on legitimate CDN URLs will be loud for the first week. Tune by allowlisting known CDNs per tool — Confluence, SharePoint, your docs site — and only alerting on the long tail.
- Tool-result-to-output URL propagation. If a URL that appeared inside a retrieved document then appears in the model’s output to the user, that is either a citation (fine) or a reflected payload (not fine). The cheap discrimination is whether the URL is inside a markdown image tag versus a link tag, plus whether the domain matches a citation source the agent was told it could cite.
A note on volume. The tool-call-trace data is heavy. A single agent conversation with five tool calls and modest context windows is easily 50–200KB of span data, much of it the raw prompt and tool inputs. At 10K conversations a day you are at 1–2GB of agent telemetry before you store the model outputs. Most teams end up with 7–14 days hot in Splunk or Elastic and the rest in S3 with Athena (or the Frozen tier) for retrospective hunting. If your CISO wants 90 days hot on this, push back with numbers.
The first week of tuning
The detections above will fire on benign behavior more often than you expect. The fixes, roughly in the order they will hurt:
The “summarize this webpage and email it to me” pattern. Users legitimately ask the agent to fetch external content and then take an action with it. Your sensitive-tool-after-untrusted-content rule will alert on every one of these. Don’t suppress the rule — carve it out by user intent. The workflow you want to keep visible is the one where the agent fetched external content the user did not explicitly point at, then took an action.
Citation URLs in markdown. The image-exfil regex will hit any agent that emits inline diagrams or screenshots in answers. Allowlist the citation domains the agent is permitted to reference, alert on everything else.
Browser agents that genuinely need to navigate arbitrary URLs. If you’ve got an agent doing web research, the “outbound to non-allowlisted domain” rule is meaningless for that workload. Segment those agents onto a separate egress profile and run a different detection — focused on what they exfiltrate from your internal state, not where they browse.
LangChain/LangGraph retry loops that double-fire tool calls when an exception is raised mid-chain. You will see the same tool call land in your trace index two or three times with different span IDs and think you have a loop attack. You probably don’t — check the parent span and the exception field before paging anyone. (The docs were clearer about this in the 0.2 era; current ones may not be.)
What teams get wrong before they get it right
Treating it as a content-moderation problem. Input/output classifiers — the LLM-guardrail products — are useful as a layer, but if your only signal is “did the prompt look like an injection,” you will miss the indirect cases entirely. The injection is not in the prompt. It is in the retrieved document, and the model already ingested it before your guardrail saw the conversation.
Trusting tool descriptions. The MCP threat model assumes the tool catalog is itself trustworthy. If your agents pull descriptions dynamically from a server you do not control, those descriptions are a content channel and need the same inspection as any retrieved document. The 2026 spec added description signing — use it, and reject unsigned servers at the gateway.
Building all of this inline in the agent runtime. Putting detection inside the model-call path is tempting because the context is right there. It is also where it will get disabled the first time it adds 400ms to p95. Push detection to the trace pipeline downstream. Let the runtime emit, let the SIEM judge.
Assuming the model is the perimeter. It isn’t. The perimeter is the tool registry, the egress controls on the runtime, and the auth scoping on what each tool can do on behalf of the agent’s service account. If your send_email tool can email anyone in the world from a service account with no per-recipient policy, no detection is going to save you on the day the injection lands. The model is the worst place to enforce authorization and it should not be the only place doing so.
Where this sits in 800-53
| Concern | Controls |
|---|---|
| Untrusted content reaching the model | SI-10, SC-7, AC-4 |
| Tool-call telemetry and trace retention | AU-2, AU-12, AU-6 |
| Agent service-account scoping | AC-3, AC-6 |
| Tool registry integrity and MCP server trust | CM-5, CM-7, SR-3, SR-4 |
| Output sanitization and egress filtering | SC-7, SI-7 |
| Detection engineering on agent traces | SI-4 |
| Response procedures for compromised agent sessions | IR-4, IR-5 |
The mapping is not exotic. The novelty is at the implementation layer, not the control catalog — AC-4 has always been about information flow, it just now has to cover “instructions hiding inside a Confluence page the RAG pipeline retrieved.”
Closing
The model is not going to solve prompt injection for you. Instruction/data separation at the model layer is improving incrementally; the agentic surface area is growing much faster. Detection has to live in the trace pipeline and the egress path, not in the prompt. Treat agents as authenticated principals with scoped tools, watch the tool-call graph the way you watch process trees on endpoints, and pay particular attention to anything the agent renders back to the user. That last surface is where most of the loss happened last year, and the fix is unglamorous: sanitize the output, filter the egress, allowlist the citations.