OIDC trust policies are the CI/CD attack surface nobody is watching

By AutoCypher · 7 weeks ago 07 Jun 2026

Keyless federation was supposed to kill the long-lived cloud credential sitting in a CI secret. Mostly it did. Instead of a static AWS_SECRET_ACCESS_KEY baked into a repo secret and rotated never, your pipeline now mints a short-lived OIDC token, hands it to STS, and gets back a session that dies in an hour. Good trade. The problem is that you moved the trust decision out of a secret store you could audit and into an IAM trust policy condition block that almost nobody reviews after it’s written once and copied into forty other roles.

The tj-actions/changed-files compromise in March 2025 (CVE-2025-30066) made the stakes concrete. An attacker modified a widely-used Action so it dumped runner memory — including the temporary credentials and tokens present during the job — into the build logs. Per StepSecurity’s writeup, the action was pulled into tens of thousands of repositories. Any pipeline that had already federated into AWS and held a live session when that step ran was exposed, and the exfil channel was the build log itself, which plenty of orgs make world-readable on public repos. So the question stopped being academic: if a compromised dependency in your CI gets a foothold inside a running job, how far does your OIDC trust policy let it walk into your cloud account, and would you see it?

The mechanism, and where it actually breaks

The flow is simple. GitHub Actions exposes an OIDC token endpoint inside the job. The token is a JWT signed by GitHub’s OIDC provider, and its claims describe the build context — repo, ref, workflow, environment, job_workflow_ref, and the all-important sub (subject), which encodes things like repo:my-org/my-repo:ref:refs/heads/main. Your AWS IAM role has a trust policy that says “allow sts:AssumeRoleWithWebIdentity from this OIDC provider, but only if the token claims match these conditions.”

The condition is the whole game. Done right, it pins to a specific repo and a specific ref or environment:

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
    "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:environment:prod"
  }
}

Done wrong — and this is the common failure — it uses StringLike with a wildcard that’s far too broad. repo:my-org/* trusts every repo in the org. repo:my-org/my-repo:* trusts every branch, every PR, every environment in that one repo, including pull_request runs from forks if the workflow is configured to grant the token there. People write the wildcard during the initial setup because pinning the exact sub is fiddly and the build keeps failing with Not authorized to perform sts:AssumeRoleWithWebIdentity, so they widen the condition until it goes green and move on. The wildcard never gets tightened. (The same way every iam:PassRole with Resource: * got there. Nobody set out to do it.)

The nastier variant is the missing aud/sub condition entirely, or a trust policy that checks only the provider. At that point any repo that can reach your OIDC provider — including a brand-new repo an attacker creates in your org if they’ve got a foothold — can assume the role. And if you’ve got the GitHub org-level OIDC issuer rather than the per-tenant https://token.actions.githubusercontent.com/my-org issuer, you have to validate the org in the condition yourself, because the default issuer is shared across all of GitHub. Miss that and the subject claim alone is forgeable by anyone with a repo named to match.

What this looks like in CloudTrail, and the gap that ruins the obvious detection

Here’s the part that trips up SOC leads who think they can just alert on the role assumption. CloudTrail records AssumeRoleWithWebIdentity as a management event, and you get requestParameters.roleArn, requestParameters.roleSessionName, sourceIPAddress, and a userIdentity block. What you do not reliably get is the full set of OIDC claims — the sub, the repo, the ref — that actually tell you which workflow on which branch assumed the role.

That’s the gap. The token claims are evaluated by STS at assume time and then mostly discarded from the log record. Some claim data shows up in additionalEventData depending on provider configuration, but you cannot count on the sub being there, and building a detection that assumes it is will quietly do nothing. (Check your own trail before you design around this — pull a known-good AssumeRoleWithWebIdentity event from your index and look at what’s actually populated. The docs imply more than you get.)

So the detections that actually work key off the things you can see: source IP, session naming, role and timing, and the downstream API calls. Three angles, in rough order of signal quality.

The strongest is source IP. GitHub-hosted runners egress from GitHub’s published ranges; you pull those from the actions arrays in https://api.github.com/meta and treat an AssumeRoleWithWebIdentity from outside them as suspicious. In Splunk, roughly:

index=cloudtrail eventName=AssumeRoleWithWebIdentity
| search NOT [| inputlookup github_actions_cidrs.csv | fields cidr | rename cidr as sourceIPAddress]
| stats count by roleArn, sourceIPAddress, roleSessionName

But read the caveats before you ship it. GitHub’s meta ranges are large, they change without notice, and the Actions egress overlaps heavily with Azure ranges because the hosted runners live in Azure — so a naive CIDR match has a wide gray zone. You need a scheduled job refreshing that lookup (daily is fine; weekly will burn you the week they re-IP). And if you run self-hosted runners on EC2, those assume the role from inside your VPC or from your NAT egress IP, which is nowhere near the GitHub ranges — so the entire IP-based detection inverts for you, and you instead alert on assumptions that come from outside your known runner subnets. Pick the model that matches your fleet; running both detections at once just doubles the noise.

The second angle is roleSessionName. The action that assumes the role sets it, and most teams use a stable convention like GitHubActions-${{ github.run_id }}. A session name that doesn’t match your convention, or repeats a name across wildly different source IPs, is worth a look. It’s a weak signal on its own — trivially spoofable by anyone who controls the assume call — but it’s cheap and it correlates well with the IP signal.

The one that catches real abuse is behavioral: what the session did. Tag the session via the assumed-role session name in your downstream CloudTrail events and baseline the API surface each CI role normally touches. A deploy role that has only ever called ecr:*, ecs:UpdateService, and a specific s3:PutObject prefix suddenly calling iam:CreateAccessKey, sts:GetCallerIdentity in a tight loop, or s3:GetObject across buckets it never reads — that’s the shape of someone exploring a stolen session. Expect the first pass to be noisy until you’ve baselined per-role, because CI roles drift constantly as pipelines get new steps. Threshold the alert on new API actions for a given roleArn over a trailing 30-day window rather than trying to enumerate an allowlist by hand; the allowlist will be stale inside a sprint.

First round of tuning

The false positives come from three predictable places. Reusable workflows and job_workflow_ref confuse the source-repo assumption — a centralized deploy workflow called from many repos produces assumptions that look cross-repo because they kind of are. Matrix builds fan out into dozens of near-simultaneous assumptions of the same role from the same run, which trips naive velocity rules; carve those out by run_id if you can correlate it. And third-party Actions that GitHub routes through different egress than you expect will land just outside your CIDR lookup and light up the IP detection for a day until you widen the range.

Volume reality: in an org running a few hundred active repos with hosted runners, AssumeRoleWithWebIdentity is a high-frequency event — easily thousands a day. Alerting on the raw event is useless. The IP-anomaly version should be near-zero in steady state if your lookup is current, which is exactly why a sudden cluster of out-of-range assumptions is meaningful. Keep CloudTrail management events hot for at least 90 days if your retention budget allows, because the investigation pivot — “what did this session touch” — needs the downstream events in the same searchable window, and pulling them back from S3/Glacier mid-incident is the part that turns a two-hour triage into a two-day one.

Fix the trust policy, not just the detection

Detection is the backstop. The actual remediation is in the trust policy condition, and it’s boring on purpose: pin aud to sts.amazonaws.com, pin sub to the narrowest claim that works — ideally an environment:prod protected environment rather than a branch, because branch refs can be created by anyone with push access while protected environments gate on reviewers. Replace StringLike wildcards with StringEquals wherever the build context is fixed. Validate the org explicitly if you’re on the shared issuer. And give each pipeline its own role scoped to its own job; the shared github-actions-deploy role that forty repos assume is a blast-radius problem dressed up as convenience.

The control mapping is straightforward and worth putting in the SSP. The wildcard trust condition is an AC-6 least-privilege failure and an AC-4 flow-control problem at the federation boundary. Federated CI identities are IA-8 (non-organizational users) and the OIDC validation logic is IA-5/IA-9. The detections live in AU-2/AU-6 with SI-4 doing the monitoring. The trust policy itself is configuration that should be baselined and drift-checked under CM-2/CM-6 — a Config rule or OPA/Conftest gate in your IaC pipeline that fails any AssumeRoleWithWebIdentity trust policy with a sub wildcard is the single highest-leverage thing here. And the whole tj-actions class of problem is SR-3/SR-4 supply chain: pin Actions to a full commit SHA, not a floating tag, because a tag is mutable and a compromised maintainer can repoint v1 at anything.

The uncomfortable truth is that the keyless pattern is correct and you should keep using it. It removed the static secret, which was the bigger risk. It just relocated the trust decision to a place your existing secret-scanning and rotation tooling can’t see — and a StringLike wildcard in an IAM trust policy doesn’t trip a single alarm until a session that shouldn’t exist starts calling iam:CreateAccessKey from an Azure IP at three in the morning.

The mechanism, and where it actually breaks

What this looks like in CloudTrail, and the gap that ruins the obvious detection

First round of tuning

Fix the trust policy, not just the detection

Sources