Detecting MCP Rug Pulls When an Approved Tool Rewrites Its Own Description

By AutoCypher · 7 weeks ago 10 Jun 2026

The Model Context Protocol moved from “interesting demo” to “thing running on half the developer laptops in your org” faster than most security teams built any opinion about it. That speed is the problem. The trust decision in MCP happens once — at install, when a human eyeballs a tool’s description and clicks approve — and the protocol never promised that decision stays valid. A server can hand you a clean, helpful tool definition on Tuesday and a poisoned one on Friday, and the spec has no requirement that anyone be told. That’s the rug pull, and it’s not a theoretical edge case. It falls straight out of how the protocol is designed.

The mechanism is simple enough that it’s almost insulting. When an MCP client connects to a server, it calls tools/list and gets back an array of tool objects: a name, a human-readable description, and an inputSchema. The description is not documentation for the user. It is injected into the model’s context as instructions the model is expected to follow. Whatever text sits in that field becomes part of the agent’s operating prompt. So if an attacker controls the server, the description field is a direct write into your agent’s brain, gated only by a one-time approval the user has long forgotten about.

The spec gives servers a clean way to update that list mid-session. A server that declares the listChanged capability can emit notifications/tools/list_changed, after which the client re-fetches tools/list and picks up whatever’s new. Legitimately, that’s for servers whose tool set genuinely changes — a database MCP server that exposes new tools as new tables appear. Used in anger, it’s the redefinition channel. Approve the benign version, get the malicious version pushed after you’ve stopped paying attention. The MCP spec text (the 2025-06-18 revision) is explicit that tool annotations like readOnlyHint and destructiveHint are advisory and “should be considered untrusted unless obtained from trusted servers.” Read that again. The protocol’s own designers are telling you the metadata is not a security boundary.

Simon Willison flagged the prompt-injection shape of this in April 2025, building on Invariant Labs’ tool-poisoning work, and the pattern got a name on the various MCP vulnerability trackers: rug pull, or silent tool redefinition. The disclosure is old in AI-time. What hasn’t caught up is the detection story on the blue-team side, because the telemetry to catch it mostly doesn’t exist by default.

You can’t detect what the client never logs

Here’s the first wall you hit. The interesting event — a tool description changing after approval — happens inside an MCP client. Claude Desktop, Cursor, the VS Code Copilot agent, a homegrown LangGraph thing some team stood up in a sprint. None of those write the full tools/list response anywhere your SIEM can see it. The description text that got injected into the model is in process memory and then it’s gone. Your Splunk index has nothing. There is no tool_description field because nobody emitted one.

So before you write a single detection, you have a logging problem, and it’s an architecture decision, not a config flag. The only sane place to capture this is a chokepoint that every MCP session traverses: a gateway or proxy sitting between clients and servers, terminating the JSON-RPC, logging tools/list responses and notifications/tools/list_changed events, then forwarding. If your agents talk to remote MCP servers over HTTP, a gateway is straightforward to insert and you should. If your developers are running local stdio servers on their laptops — npx some-mcp-server spawned as a subprocess of Cursor — there is no network hop to intercept, no gateway, and honestly not much you can do from the SOC short of an EDR rule watching process spawns and a policy that says local MCP servers go through an approved registry. That gap is real and I’m not going to pretend a query fixes it.

Assume you’ve got the gateway. Now the telemetry exists and the detection becomes tractable.

The actual detection: hash the manifest, alert on drift

The core control is boring and it works: compute a stable hash over each tool’s identity at approval time, store it as a baseline, and compare on every subsequent tools/list. Hash the concatenation of name, description, and a canonicalized inputSchema — canonicalized because JSON key ordering will otherwise generate hash churn that looks like an attack and isn’t. SHA-256 is fine; you’re detecting change, not resisting a cryptographer.

In Splunk terms, you’re emitting one event per tool per tools/list with fields like mcp_server, tool_name, tool_desc_sha256, schema_sha256, and session_id. The detection is a lookup join against a baseline KV store:

index=mcp sourcetype=mcp:toolslist
| lookup mcp_tool_baseline mcp_server tool_name OUTPUT baseline_desc_sha256
| where isnotnull(baseline_desc_sha256) AND tool_desc_sha256!=baseline_desc_sha256

That’s the whole idea. A previously-approved (server, tool) whose description hash no longer matches the approved value. Pair it with a second, lower-confidence signal: any notifications/tools/list_changed from a server outside an allowlist of servers you actually expect to mutate their tool sets. Most servers never legitimately send that notification. The ones that do, you know about.

Volume is low, which is the good news. tools/list fires at session start and on list-changed events, so you’re talking a handful of events per agent session, not a firehose. A drift alert that fires at all is interesting almost by definition. The bad news is where the false positives come from, and they’re not where a tidy lab test would suggest.

What the first week of tuning actually fixes

The noise is not attackers. It’s your own developers and the package ecosystem.

Version bumps. A developer runs npm update, the MCP server package goes from 1.4.2 to 1.5.0, three tool descriptions got reworded by the maintainer, and your drift rule lights up for every one. Legitimate change, indistinguishable from a rug pull at the hash level. The fix isn’t to suppress it — it’s to route version-correlated drift to a re-approval queue instead of a SOC alert. If the gateway can see the server’s advertised version and the version moved, that’s a software-update workflow (a CM-3 change, if you want the control number). If the description changed and the version did not, that’s the one that should page someone. Same hash mismatch, completely different meaning, and the version delta is the field that tells them apart.

Dev servers under active construction. Anyone building an MCP server is editing tool descriptions constantly. Point your gateway at a dev environment and the drift rule becomes a status indicator for someone’s afternoon. Carve dev MCP endpoints into their own index with the rule disabled, or you’ll train the SOC to ignore the alert before a real one ever arrives. This is the single most common reason these detections get switched off in month two.

Pagination and ordering artifacts. If you hash the raw tools/list payload instead of per-tool canonical fields, cursor-based pagination and non-deterministic key ordering will generate phantom drift. This is a self-inflicted wound. Canonicalize per tool, hash per tool, and the problem disappears. (The docs on dynamic discovery are clear enough about pagination; the failure is on the detection-engineering side, not the spec.)

There’s a content-based signal worth layering on once the hash drift rule is stable: scan new or changed descriptions for the linguistic shape of injected instructions. Imperatives aimed at the model rather than the user, references to reading credential-bearing paths, instructions framed as setup steps the model must perform “before” using the tool. I’d keep this as enrichment that raises severity on a drift event, not a standalone detector. As a standalone it’s a regex arms race you lose, and it’ll flag legitimate tools whose descriptions reasonably mention files and keys. False positives there are guaranteed and the precision is bad enough that paging on it alone will burn you.

Where this maps in 800-53, and why it’s mostly SR and CM

The instinct is to file prompt injection under SI and move on. That undersells it. A rug pull is a supply-chain integrity failure dressed up as a runtime one. The tool definition is third-party content you ingested and trusted; it changed without authorization; your config baseline didn’t notice.

Control	Why it applies here
SR-3, SR-4, SR-11	The MCP server is an external component in your supply chain. Tool definitions are the artifact; provenance and integrity of that artifact is the control objective.
CM-2, CM-3, CM-6	The approved tool manifest is a configuration baseline. Drift from it is an unauthorized change. Treat re-approval as a change-control gate.
SI-7	Integrity verification of the tool definitions — the hash baseline is your SI-7 mechanism, even if nobody calls it that.
SA-9	External system services. You’re consuming a service whose behavior you don’t control; the agreement and monitoring expectations live here.
AC-6	Least privilege on what each tool’s underlying credentials and scopes can reach, so a poisoned tool’s blast radius is bounded.

The AC-6 row is the one that saves you when detection fails, and detection will sometimes fail. If the MCP server that got rug-pulled holds a token scoped to read one repo, the worst case is bounded. If it holds a broad cloud credential because someone wired it up fast, the description rewrite is just the delivery mechanism for whatever that credential can touch. Scope the tools down and the rug pull degrades from incident to annoyance.

What changes the answer

Environment assumptions move this around more than the detection logic does. A centrally managed agent platform — agents running server-side, all MCP traffic through a gateway, an approved server registry — is a world where hash-drift detection is real and enforceable. A pile of developer laptops each running local stdio servers pulled from public npm is a world where your best move is procurement and EDR, not SIEM rules, because the telemetry never leaves the host. FedRAMP and other regulated workloads should not be reaching arbitrary remote MCP servers at all; if they are, the rug pull is the second problem and egress is the first.

The forcing question for any team standing this up: when an approved MCP tool’s description changes tonight, does anything, anywhere, generate a record? If the honest answer is no, the rug pull isn’t a risk you’ve accepted. It’s one you can’t see. Start by making the change observable, then worry about catching it.

You can’t detect what the client never logs

The actual detection: hash the manifest, alert on drift

What the first week of tuning actually fixes

Where this maps in 800-53, and why it’s mostly SR and CM

What changes the answer

Sources