Ollama’s GGUF Quantizer Bleeds Heap, and the Exfil Rides Its Own Push API
Roughly 300,000 Ollama servers sit on the public internet, per Cyera’s disclosure of CVE-2026-7482, and the default install does exactly what makes that number dangerous: it binds to 0.0.0.0:11434 with no authentication. The bug is a heap out-of-bounds read in the model quantization path. The payoff for an attacker is the contents of the process heap, which on a busy inference box means user prompts, other users’ system prompts, and whatever lived in the environment block when the process started. Cyera’s writeup shows AWS access keys and CLAUDE_CODE_USE_BEDROCK pulled straight out of the leaked memory, because of course someone wired Ollama into Claude Code and the credentials rode along.
That’s the part that should bother an ISSO. This isn’t a model that says something rude. It’s pre-auth memory disclosure on an appliance most shops don’t even have in the CMDB, and the appliance barely logs.
The bug, in one breath
GGUF is a binary container for model weights. The header declares each tensor’s shape, and Ollama’s quantization engine trusts that shape. When you create a model from an uploaded file and ask it to quantize, the engine computes the element count by multiplying the declared dimensions, then hands that count to a conversion routine that reads exactly that many elements out of the source buffer. Nothing checks the declared count against the actual size of the data you supplied.
Go is memory-safe right up until someone reaches for the unsafe package, and that’s precisely where this lives. The conversion uses unsafe.Slice against an attacker-controlled length, so instead of a panic you get a clean read off the end of the buffer and into adjacent heap. The over-read data gets folded into a new model layer and written to disk as a legitimate-looking model.
The elegant nasty bit is the conversion choice. Most quantization is lossy, which would corrupt whatever you leaked. But declaring the source as F16 and asking for F32 as the target is a numerically lossless, fully reversible widening — every 16-bit value maps cleanly to a 32-bit one with no rounding — so the leaked heap survives the conversion intact and reconstructs off the saved model. It isn’t byte-for-byte on disk (the width doubles), and the bit patterns themselves change — F16 and F32 encode their exponents differently — but every original 16-bit value is preserved exactly, so an attacker can reconstruct the original 16-bit words from the F32 output, which is all they need to recover the leaked bytes. The root cause is textbook SI-10 input validation: metadata describing a buffer was treated as authoritative about the buffer’s size. CWE-125 wearing a machine-learning hat.
The exfil path is also the detection gift
Here’s what makes this worth a SOC lead’s attention rather than just a patch-and-move-on. Reading heap into a file on the server is useless to a remote attacker without a way home. Ollama provides one. The /api/push endpoint will upload a model to a registry, and it decides where to send it by parsing the model’s name. If the name looks like a URL, Ollama pushes there. Nothing stops you from naming a model http://somewhere-else/ns/model:tag and having the daemon ship the leaked layer straight out.
So the full sequence is: upload a blob to /api/blobs/sha256:..., call /api/create with quantize set to trigger the over-read, then /api/push to drag the result off-box. Three calls. No credentials.
For a defender that sequence is a fingerprint, and the egress leg is the strongest signal you’ll get. A pure-inference Ollama box pulls models and serves /api/generate. It does not, in normal operation, push models out to random hosts. An outbound model push to anything that isn’t your registry is close to a smoking gun.
The logging gap you have to fix before you can detect anything
The first thing most teams discover when they go looking is that there’s nothing to look at. Ollama’s own logging is thin, oriented at operational debug output, not security-grade HTTP access records. You will not find a tidy access log mapping client IP to /api/create calls the way nginx would hand it to you. If the daemon is a docker run one-liner some engineer fired off in week three of an LLM proof of concept, your coverage on it is zero. That’s the real AU problem here: the events you’d want to alert on mostly aren’t being generated.
So before detection, instrumentation. Two practical options, and I’d reach for the network layer first.
Put the daemon behind a reverse proxy you already trust for logging. nginx or Envoy in front of 11434, request logging on, and now /api/create and /api/push show up as $request_uri with method and source. The catch: the malicious model name lives in the JSON body of /api/create, not in the URI, so URI-only access logs see the endpoint but not the http://attacker giveaway. You’d need body capture to get that, and body logging on an inference proxy is its own retention headache (prompts are sensitive, and now you’re storing them in the proxy log too). Pick your poison.
The cleaner signal is east-west and egress. Zeek on the segment, or your existing egress proxy, watching what the inference host sends. A model push to a registry speaks a Docker-registry-style protocol, so you’re looking for outbound PUT/POST to /v2/.../blobs/... paths headed somewhere that isn’t registry.ollama.ai or your internal registry. In Zeek’s http.log that’s method, host, and uri. The query shape, Splunk if you have it:
index=zeek sourcetype=zeek:http
id.orig_h IN (ollama_hosts)
uri="*/v2/*/blobs/*"
NOT host IN ("registry.ollama.ai","your-internal-registry.corp")
| stats count min(_time) max(_time) values(host) by id.orig_h
The Elastic equivalent works the same way against http.request.method and url.path, the ingest pipeline is just messier if you haven’t already normalized Zeek fields.
One hard caveat on this whole approach: it assumes the push is visible in cleartext, or that you’re terminating TLS at an egress proxy. If the daemon pushes over HTTPS straight out — the likely real-world case — Zeek’s http.log won’t see the /v2/.../blobs/... path at all. You’re down to ssl.log: SNI, JA3, and the destination address. So fall back to matching the destination — SNI or resolved host against your registry allowlist — rather than the URI path, unless policy lets you intercept. The signal survives the encryption; the path-based signature does not.
What the first round of tuning actually fixes
The false positives come from one place: teams that legitimately quantize and publish models. An ML platform group building custom fine-tunes will call /api/create with quantize all day and push the results to an internal registry. That’s not an attack, that’s their job. If you alert on /api/create-with-quantize globally, you’ll bury the SOC under their CI in the first hour.
The tuning that matters is the destination allowlist, not the endpoint. Quantization is normal. Pushing to a host nobody recognizes is not. Build the allowlist from your actual registries (Harbor, ECR, the public Ollama registry if you use it) and alert only on push destinations outside it. In a shop where Ollama is strictly inference, push volume to anywhere should be effectively zero, so any hit is high-signal and you can page on it. In a shop with a model-building team, expect a handful of legitimate internal-registry pushes per build and tune the allowlist until the external-destination alert sits near zero on a normal week.
One correlation gotcha. If you try to tie an inbound /api/create to the outbound push in a tight time window, watch the clocks. Inference boxes, especially GPU instances spun up ad hoc, drift, and a daemon that came up without NTP will skew your join. Loosen the window or correlate on the host and destination rather than precise sequencing.
Which assumptions change the answer
The 0.0.0.0 default is the whole ballgame, and Docker makes it worse. People run -p 11434:11434 to reach the API from their laptop and quietly publish it to every interface the host has. If your Ollama is bound to 127.0.0.1 and fronted by an authenticating gateway, the unauthenticated-internet exposure largely evaporates and you’re left with insider and SSRF reach, which is a real but much smaller blast radius.
Kubernetes shifts it again. An Ollama Deployment behind a ClusterIP with no ingress isn’t internet-reachable, so the threat is lateral movement from a compromised pod rather than a scanner on the open internet — which is still viable for anyone who already has cluster access, so reach for NetworkPolicies to restrict pod-to-service traffic. An EKS setup that fronts it with an internet-facing LoadBalancer because someone wanted a demo URL is back to the exposed case. Check the service type before you decide how loud to be.
And the GPU-instance angle is worth a beat: those hosts are expensive, frequently outside the standard golden image, and provisioned by data scientists rather than platform engineers. Agent coverage on them is the gap. Don’t assume your EDR is reporting from the box that’s actually running inference.
Remediate, then go find the ones you don’t know about
Patch path first: the fix landed in Ollama 0.17.1, so upgrade past it (SI-2). If you can’t patch immediately, the interim is least functionality. Restrict or disable the model upload and create surface, bind to localhost via OLLAMA_HOST, and put a boundary control in front so 11434 is never directly exposed to an untrusted segment. That’s CM-7 and SC-7 doing the work the application won’t do for you.
Note the disclosure mess while you’re tracking it, because it makes your vuln-management feed unreliable. This bug carries two CVE IDs. Echo, acting as a CNA, assigned CVE-2026-7482 at CVSS 9.1, and CERT/CC issued VU#518910 under CVE-2026-5757 after it couldn’t reach the vendor. Same flaw, two records. If your scanner keys on one ID and your ticketing keys on the other, you can close one and leave the other open looking like an unrelated finding. Map them to a single item by hand.
The step zero nobody likes: you can’t patch what you can’t see. Sweep for 11434, banner-grab for the Ollama is running response, and reconcile against the CMDB (RA-5, CM-8). runZero, nmap, or a Shodan query against your own ranges all get you there. Expect to find instances no ticket ever authorized, because shadow AI infrastructure is the normal state right now, not the exception.
The control families line up cleanly. SI-10 is the root cause, SI-2 is the patch, SI-4 plus AU-2 and AU-12 are the detection you have to build — the event selection and generation the app won’t emit on its own — with AU-6 covering the review once the events exist. SC-7 and CM-7 keep the API off untrusted networks. AC-4 is the information-flow failure the push-to-arbitrary-URI represents, and it’s the one a future version of this bug class will keep reintroducing as long as a model name doubles as a destination.
The heap read is the vulnerability. The unlogged, internet-facing daemon nobody inventoried is the incident.
Sources
- Bleeding Llama: critical unauthenticated memory leak in Ollama (Cyera)
- VU#518910 – Ollama GGUF quantization remote memory leak (CERT/CC)
- CVE-2026-5757 (CVE.org)
- CVE-2026-7482 Ollama vulnerability (Echo)
- Ollama out-of-bounds read vulnerability allows remote process memory leak (The Hacker News)
- Ollama vulnerability CVE-2026-7482: find impacted assets (runZero)
- Ollama heap out-of-bounds read vulnerability leads to remote process memory leak (CVE-2026-7482) (Qualys ThreatPROTECT)