GitHub Logs the Workflow Run, Not the Cache Write. That’s the Gap Cache Poisoning Lives In

A pull request opened a bundle-size check. Six minutes later, 84 malicious versions across 42 @tanstack/* packages were live on npm, signed with the project’s own publishing identity. No maintainer pushed code. No release branch moved. The thing that actually carried the payload across the trust boundary was a cache entry — a pnpm-store tarball written by a fork’s workflow and later restored, unquestioned, by a privileged release job running on main. That’s CVE-2026-45321 — CVSS 9.6 on GitHub’s own CNA score, with NVD yet to publish an independent rating — in CISA’s KEV catalog with a June 10 mitigation due date. And the reason it’s worth your time isn’t the npm fallout. It’s that the load-bearing event — the cache save — produces close to nothing in the telemetry you’d normally reach for.

If you spent the spring auditing your OIDC trust policies, good. This is the adjacent failure you probably didn’t audit, because it doesn’t look like an identity problem until the very last step.

What the cache scope actually is

GitHub Actions cache is repository-scoped with a branch-derived read hierarchy, and that sentence is the whole vulnerability. A cache entry written by a workflow running against main is readable by feature branches and PRs derived from main. The intent is benign: your PR shouldn’t have to rebuild node_modules from scratch when main already cached it. The hierarchy flows the other way too in the case that matters here — a workflow that executes in the base-branch context writes to the base scope, and everything downstream of main can read it.

pull_request_target is the trigger that puts fork-controlled inputs into that base context. Unlike pull_request, which runs the fork’s code in a sandboxed scope with a read-only token, pull_request_target runs in the base branch’s context with the base branch’s permissions. By itself that’s fine — it exists so that workflows can label PRs, post bundle-size comments, that sort of thing, with write access. It becomes a Pwn Request the moment the workflow also checks out the fork’s head:

# the shape of the problem, not a working exploit
on: pull_request_target
jobs:
  bundle-size:
    steps:
      - uses: actions/checkout@v4
        with:
          ref: refs/pull/${{ github.event.pull_request.number }}/merge

Now untrusted code runs with base-context credentials. The attacker doesn’t need to do anything loud with them. The cache step in that job runs with the cache service’s own credentials — ACTIONS_CACHE_URL and ACTIONS_RUNTIME_TOKEN, which actions/cache (and any code that loads @actions/cache) uses to talk to the cache service. That access is independent of GITHUB_TOKEN. Pinning the workflow to permissions: contents: read does nothing to stop this — it constrains what GITHUB_TOKEN can do to the repo, but the cache service authenticates off ACTIONS_RUNTIME_TOKEN, not the workflow token. So a malicious step’s cache write goes through no matter how tightly you’ve scoped the workflow token. With those in hand, a malicious install script in the fork’s code — anything the workflow runs after checkout — populates the store with a stealer payload, and the job’s cache step saves it under the exact key a downstream job will request: ${{ runner.os }}-pnpm-store-${{ hashFiles('**/pnpm-lock.yaml') }}. Because the job ran in base scope, the entry lands in the base scope. Every subsequent run that matches the key restores it.

And here’s the part people miss on the first read: you do not need an exact key match. restore-keys exists to give you fallback. A release workflow keyed on Linux-pnpm-store-<hash> with a restore-keys: Linux-pnpm-store- fallback will happily accept a poisoned entry under a looser prefix if the exact hash isn’t present. So the attacker doesn’t have to predict your lockfile hash. They poison the prefix and wait — with the one caveat that restore-keys returns the most recent prefix match, so the poisoned entry has to be the newest under that prefix when the release runs, which for a rarely-changing prefix it usually will be. In the TanStack chain, the publish job restored that store and executed the attacker’s code the moment it ran install-and-build against it, and the final move was minting an npm publishing token from the job’s OIDC credentials — extracted from the Actions runner at runtime (the ACTIONS_ID_TOKEN_REQUEST_TOKEN/ACTIONS_ID_TOKEN_REQUEST_URL pair are exposed to job steps only when the job is granted id-token: write; absent that, the documented TanStack path pulled the token out of the runner process memory) — at which point the malware published under a trusted identity. The payload was Mini Shai-Hulud — the self-propagating npm worm run by the crew tracked as TeamPCP (Google Threat Intelligence calls them UNC6780), and the May 11 wave hit 170-plus npm and PyPI packages, not just TanStack. It went after credentials wherever the poisoned packages executed — AWS, GCP, Azure, Vault, npm, GitHub, and SSH secrets, plus .npmrc files — and used stolen npm tokens to republish itself into other packages the victims maintained, which is the part that makes a worm a worm. It exfiltrated what it collected over Session, the encrypted messenger, per the GitHub advisory.

Why your audit log is quiet

Here’s where defenders get a nasty surprise. Cache save and restore operations are not in the GitHub audit log. The audit stream (Enterprise tier) will happily tell you a workflows.completed_workflow_run happened, who triggered it, which workflow file ran. It will not tell you that the job wrote a cache entry, what key it used, or how big it was. The cache service is a separate data plane and it does not feed the audit log on any tier I’m aware of as of mid-2026. So the single most important forensic event in this attack chain — the poisoned write — leaves no audit record at all.

There’s one ephemeral exception worth naming, so you’re not blindsided when someone raises it: the actions/cache step does print Cache saved with key: ... (and Cache restored from key: ...) into the job’s run log, and you can scrape those via the Actions logs API. But a per-run log scrape gated by your retention window is a forensic aid for reconstructing an incident you already suspect — not a security event stream you can alert off. Once the run’s logs age out, even that’s gone.

What you do have is the Actions cache management API. It’s pull-based, you have to ask for it, and it’s the only durable witness:

gh api --paginate repos/<owner>/<repo>/actions/caches \
  --jq '.actions_caches[] | {key, ref, size_in_bytes, version, created_at, last_accessed_at}'

That returns each cache entry’s key, the ref it’s scoped to, its size, a content version, a creation timestamp, and — usefully — a last_accessed_at that tells you when the entry was last restored, which is your closest proxy for “a release job just consumed this.” This is your artifact. Not a stream — a snapshot you have to diff against itself over time. Which means the detection isn’t a SIEM rule firing in real time off a log line. It’s a scheduled job that enumerates caches, normalizes the rows, and flags anomalies — and here the scope mechanic that made the attack work also blunts the obvious heuristic. Because the pull_request_target job wrote in base context, the poisoned entry shows up under refs/heads/main, the same ref a legitimate main-branch build would use, so you can’t pick it out by ref alone. What you can catch is content drift on a key: a size_in_bytes that jumped well outside the baseline for that key prefix (the 2.3 MB obfuscated stealer binary in the TanStack payload was not a rounding error against a normal pnpm store delta), or a fresh version written for an existing key with no push to main behind it to explain it — and a last_accessed_at that ticks forward when a release job runs is your confirmation the poisoned entry was actually consumed. Neither is as clean as a single log line — which is exactly the gap.

Pipe that snapshot into Splunk via a scripted input on a cron, or the Elastic equivalent if that’s your stack — honestly either works because the volume is tiny, we’re talking dozens of cache rows per active repo, not a firehose. (The caches endpoint returns up to 100 entries per page — hence --paginate above — and your job has to persist each snapshot to diff size_in_bytes and last_accessed_at against the prior run; the detection is stateful, not a single call.) The hard part was never ingest. It’s that you have to go get the data, and most shops have no job that does.

The detection that floods the SOC, and the one that doesn’t

The naive rule — “alert on cross-scope cache access between a PR and main” — is a false-positive generator that will bury you, and it misreads the mechanics besides. Legitimate cross-scope reads are normal and run the opposite direction from what the rule assumes: a PR build pulls from the base branch’s cache by design, and default-branch caches are shared widely downstream. A normal pull_request cache, by contrast, is scoped to its own refs/pull/*/merge and can’t be restored by main at all — so a “PR writes, main reads” rule won’t even find this attack, while a looser ref-based rule drowns you in the benign base-to-PR reads that caching exists to provide. Ship either and you’ll turn it off inside a week. Don’t.

The signal is narrower and it’s conjunctive. You care about a cache write that satisfies all of: the writing workflow was triggered by pull_request_target, and that same workflow checked out an untrusted ref (pull_request.head.sha, refs/pull/*/merge, or head.ref), and a write occurred to base scope. The first two you get by parsing .github/workflows/*.yml — this is static config audit, and it’s where the real coverage comes from. CodeQL ships an actions/cache-poisoning/direct-cache query (“Cache Poisoning via caching of untrusted files”) that encodes exactly this pattern; run it in code scanning across your org and treat any hit as a finding to remediate, not an alert to triage. The third condition — the actual write — you can only confirm after the fact from the cache API snapshot, and by then the poisoned entry may already be sitting in scope waiting for the next release.

So the honest framing: the config audit is your prevention and it’s high-confidence. The cache-API diff is your detection and it’s late and lossy. Treat them as different jobs with different SLAs. If you only do one, do the config audit, because the runtime detection cannot fire before the poison is already planted.

Environment assumptions that change the answer: if you run self-hosted runners, the blast radius is worse, because the cache service token and a persistent filesystem now sit on infrastructure you own and the malware can pivot off the box — into the host network, a cloud metadata endpoint (IMDS), or credentials left behind in the runner user’s home directory across jobs. GitHub-hosted ephemeral runners at least burn down after the job. If your org has the “Require approval for all outside collaborators” Actions setting flipped on, that helps for ordinary pull_request workflows — but it does not protect a vulnerable pull_request_target workflow. Those run in the base-branch context, which GitHub treats as trusted, so they fire regardless of fork-approval settings; TanStack’s own postmortem notes its pull_request_target workflows auto-ran with no first-time-contributor approval. Don’t count the approval toggle as coverage for this trigger. On GitHub Enterprise Server rather than cloud, your audit retention and API shape differ enough that you should test the cache API call against your version before assuming the --jq above returns the same fields.

Remediation, in priority order

GitHub shipped actions/checkout@v7 on June 18, 2026, which refuses to check out the head of an unreviewed fork PR inside a pull_request_target workflow. Pin to it. But pinning checkout is the floor, not the fix, because a workflow can fetch fork code without actions/checkout at all — a raw git fetch of the PR ref does the same thing and v7 never sees it.

The structural fix is to stop the privileged context and the untrusted code from sharing a cache scope. Split your cache keys by trust: prefix PR-triggered caches with pr- and release caches with release-, so a fork write can never satisfy a release job’s restore-keys. Better still, don’t cache in the publish job at all. A release runs rarely and a clean pnpm install --frozen-lockfile costs you a minute or two — cheap insurance against restoring an attacker-controlled tarball into the one job that holds your npm token. If you must cache there, validate the restored tree (a lockfile-pinned install run with --ignore-scripts, plus an integrity check) before any code from it executes — lifecycle scripts are the execution vector, so suppressing them is most of the battle. And on the runner itself, a behavioral endpoint or application-control policy that blocks package managers — npm, yarn, pnpm — from spawning shells or dropper utilities during a postinstall hook is a defense-in-depth layer for the paths where scripts do run.

And separate the jobs entirely. The bundle-size comment job needs write permission to post a comment; it does not need to run fork code in base scope. Move the untrusted build to a pull_request-triggered job with a read-only token, have it emit the size number as an artifact, and let a separate pull_request_target job read that artifact and post the comment without ever touching fork code. It’s more YAML. It’s also the difference between a comment bot and a publishing channel.

One audience note, because everything above is written for the repo in TanStack’s seat — the target whose pipeline gets turned into a publishing channel. If you were downstream — you pulled an affected @tanstack/* version during the May 11 window — that’s a different job and an incident-response one: pin your lockfile back to a known-clean pre-compromise release, rotate any credential the install could have reached (Mini Shai-Hulud scrapes broadly), and check whether it republished into packages your own org maintains. Self-propagation is the whole point of the Shai-Hulud family, so containing it isn’t done when you’ve patched the one package you noticed.

Where this maps

The control story here (NIST SP 800-53 Rev. 5) is mostly integrity and supply chain, with an audit gap underneath it.

Control What it covers here
SR-3, SR-4 Supply chain controls for the build pipeline; provenance of cache artifacts consumed by release jobs
SI-7 Software/information integrity — validating cache contents before execution is a SI-7(6) integrity-check problem
CM-3, CM-7 Change control on .github/workflows/; least-functionality on workflow triggers and token scope
SA-15, SA-11 Secure development process and static analysis — the CodeQL config audit lives here
AU-2, AU-12 Audit event selection. The relevant point is the deficiency: cache operations are not auditable events on the platform, so you are compensating with API polling
SC-7 Boundary protection between untrusted fork context and privileged base context

The AU rows are the uncomfortable ones. You can write a beautiful SI-7 control narrative for cache validation, but if your assessor asks where the audit trail for cache writes is, the honest answer is that the platform doesn’t produce one and you’re synthesizing it from a polled API on a schedule. Document that as a known limitation rather than pretending the audit log covers it. It doesn’t.

Cache poisoning isn’t a bug GitHub can fully patch, the same way the OIDC sub condition wasn’t. The cache hierarchy works as designed; the trust boundary is one you draw in your own workflow files. Draw it before the next bundle-size PR draws it for you.

Sources