Watch the Image Fetch: Detecting Indirect Prompt Injection on Egress

The thing that made EchoLeak (CVE-2025-32711) worth reading about wasn’t the prompt. It was the exit. Aim Labs’ disclosure and the later arXiv writeup walk through a chain against Microsoft 365 Copilot where a single crafted email, sitting unread in a mailbox, gets pulled into the RAG context during a normal user query and coaxes the model into stuffing sensitive context into an outbound reference-style Markdown link — one that resolves through a Content-Security-Policy-approved Microsoft domain. CVSS 9.3, zero user clicks, patched server-side in June 2025. Microsoft says there’s no evidence of in-the-wild abuse. Fine. The mechanism is the lesson, and it generalizes well past Copilot.

Here is the part defenders keep getting wrong: they try to catch the injection. They buy or build a classifier that scans retrieved content for “ignore previous instructions” and its cousins, declare the prompt-injection problem handled, and move on. Microsoft had a classifier for exactly this — the XPIA (Cross-Prompt Injection Attempt) filter. The EchoLeak chain walked past it by writing the malicious instruction as ordinary prose addressed to a human, never mentioning AI, Copilot, or anything the classifier was trained to flag. That’s not a tuning failure. That’s the structural reality of trying to detect adversarial intent in open-ended natural language. You will lose that race often enough that it can’t be your only line.

The injection is the hard detection surface. The exfiltration is the easy one. So instrument the exfiltration.

The channel is the render step

Almost every practical LLM data-leak ends the same way: the model’s output causes an HTTP request that carries data to a destination the attacker controls. In a chat or copilot UI, the dominant primitive is the auto-fetched image. The model emits Markdown image syntax, the client renders the answer, the rendering layer issues a GET for the src, and whatever the attacker convinced the model to encode into that URL is now in someone’s access log. No click. The user sees a broken image icon, maybe, if they’re looking.

Reference-style Markdown is the variant that beat Copilot’s link redaction. Instead of an inline [text](url), you define the link target separately and reference it by label. Redaction logic that pattern-matched on inline links didn’t catch the split form. Small parsing gap, full bypass. (The docs and most defensive write-ups still under-describe this; if your output filter only understands inline links, it understands half the problem.)

Then there’s the CSP question, which is where EchoLeak got genuinely clever. A strict img-src allowlist should stop an image fetch to an attacker domain cold. But the allowlist included broad, trusted Microsoft origins, and one of them fronted a proxy that would relay the request onward. Allowlisting a whole trusted domain that happens to contain an open redirect or a relay is the same as not having an allowlist. The boundary was there. It leaked because it was drawn around brand, not behavior.

If you run agentic systems with real tool access, none of the rendering discussion even applies — a tool that can make an outbound HTTP request is the channel, directly, no Markdown required. The render-step exfil is the version that survives in read-only assistants. Tool-equipped agents hand the attacker a cleaner exit and you should assume the bar is lower there, not higher.

Where you can actually put a sensor

First, the uncomfortable scoping. If your exposure is M365 Copilot SaaS, you mostly can’t instrument this channel. The rendering happens in Microsoft’s surface, the telemetry is theirs, and the EchoLeak-specific fix shipped server-side. Your levers there are blast-radius levers — Purview DLP, restricting which mailboxes and sites Copilot can reach, trimming sensitivity-labeled content out of scope. Useful, but it’s risk reduction, not detection. Don’t let a vendor sell you “Copilot prompt-injection monitoring” as if you have a sensor where you don’t.

Where you can instrument is your own stuff: the homegrown RAG app, the internal support assistant, the agent fronting your knowledge base. If that app sits behind a model gateway — LiteLLM, an APIM facade in front of Azure OpenAI, whatever you’ve standardized on — you have two clean tap points, and you want both.

Tap one is the completion text, before it renders. Log the raw model output and scan it for Markdown image and link syntax pointing off-domain. This is the highest-signal, lowest-effort detection in the whole problem, because the attacker’s exit is in the text you already have. A starting regex matches inline images !\[[^\]]*\]\(\s*(https?://[^)]+)\), inline links, and crucially the reference definitions ^\s*\[[^\]]+\]:\s*(https?://\S+) on their own lines. Extract the host, compare against an allowlist of domains your assistant is actually supposed to emit. Anything else is a finding.

Tap two is egress. Outbound GETs from the rendering context to a destination not on the allowlist, ideally enriched with the request originator so you can tie it to the assistant rather than the user’s general browsing. If you’ve got Splunk, this is your proxy sourcetype (Zscaler, Netskope, or a plain egress proxy) filtered on http_method=GET and dest not in your allowlist lookup, joined where possible against the app’s identity. The Elastic equivalent works too, it’s just messier to correlate the originating process into url.original because the ingest pipeline doesn’t carry that context unless you’ve already enriched it upstream.

Use both because each covers the other’s gap. The completion scan catches the payload even when the client refuses to fetch it; the egress sensor catches the fetch even when something rewrote the output after your gateway saw it.

What the first week of tuning actually looks like

The completion-text detector will light up immediately, and most of it will be legitimate. Assistants emit images on purpose — rendered charts, inline diagrams, avatars, and especially citation links back into SharePoint or your wiki. A RAG app that cites sources is supposed to produce off-host-looking links constantly. So the first tuning pass is building the allowlist, and the allowlist is bigger than you think: your own domains, your doc store, your CDN, the chart-rendering service, the identity provider’s avatar host. Expect to spend the first round adding legitimate destinations, not chasing attackers.

The egress side has a nastier false-positive source: long, high-entropy URLs are normal. Signed S3 URLs carry long opaque signatures. SSO redirects and analytics beacons carry long tokens. If your detection logic is “long query string with base64-looking content equals exfiltration,” you’ve built an alert that fires on every pre-signed download link in the environment, and the SOC will mute it inside a week. Entropy alone is a weak signal here and I’d push back hard on any vendor rule that leans on it as the primary discriminator.

The discriminator that holds up is destination, not payload shape. “Off-allowlist destination, request originated from the assistant’s rendering context, content type is image” is a tight predicate. Layer entropy on top as a ranking signal if you want, but don’t make it load-bearing. And accept the honest limit: you usually cannot prove the query string contains exfiltrated secrets versus random bytes without decoding it, and you often can’t decode it. The alert tells you the assistant tried to talk to a place it shouldn’t. That’s enough to act on. Treat “prove the data was sensitive” as a forensics question for after the block, not a gate on the detection.

Volume, roughly: in a few-hundred-user internal assistant, the completion-text detector will produce a handful of distinct off-allowlist destinations a day during week one, collapsing to near zero once the allowlist stabilizes. When it spikes after that, it means either someone shipped a new integration (boring, the common case) or retrieved content is steering output toward a new destination (not boring). Both are worth a look. The second is the one you built this for.

The controls underneath, and which ones are brittle

The single highest-leverage control is not detection at all. It’s refusing to auto-fetch external images in the assistant’s render layer, or restricting fetches to an allowlist of origins. Kill the zero-click and you’ve demoted a silent exfil to a link the user has to deliberately click, which is a categorically smaller problem. Everything else is defense in depth around that.

Control 800-53 Where it bites in practice
Disable/allowlist external image auto-fetch in render CM-7 The fix nobody owns — it lives in the front-end, the security team doesn’t control it
Strict img-src/connect-src by behavior, not brand SC-7 Allowlisting a whole trusted domain with an open relay re-opens the hole
Treat retrieved content as untrusted; block its instructions from triggering fetches/tool calls AC-4, AC-3 Hard to enforce cleanly; this is the spotlighting/provenance research area, not a checkbox
Output filtering aware of reference-style Markdown SI-15 Inline-only link redaction is the bypass that beat Copilot
Limit the assistant’s data scope and permissions AC-6 Shrinks blast radius; doesn’t detect anything
Log model inputs and outputs AU-2, AU-12 Most shops log neither, so there’s nothing to hunt in

That last row is the quiet failure. A lot of LLM deployments log the API call metadata and nothing of the actual completion, sometimes deliberately because completions contain sensitive data and nobody wanted that in the index. Understandable. But it means when something does go sideways, the IR team has no record of what the model emitted, and the completion-text detector above has nothing to read. Decide that retention question on purpose, with a real conversation about where those logs live and who can read them, instead of discovering the gap mid-incident.

Microsoft’s own answer, documented in their July 2025 MSRC post, leans on Spotlighting — marking untrusted retrieved content so the model can tell instructions-from-the-user apart from text-it-was-asked-to-process — plus deterministic guardrails and a dedicated classifier. The marking-and-provenance approach is the most architecturally honest one in the set, and it’s also the least finished. Treat it as a research direction that’s shipping, not a solved control. The deterministic egress boundary is the part you can build today and reason about tomorrow.

Stop spending your whole budget trying to read the attacker’s mind in the prompt. Watch where the answer tries to phone home.

Sources