AI Security & Agentic AI Threat Model

Most AI security writing in 2026 is still split between two audiences: ML researchers writing about adversarial perturbations on image classifiers, and security generalists writing about prompt injection in chatbots. Neither view is wrong, and neither is sufficient for the systems actually being deployed today — agents that plan, invoke tools, hold state across sessions, and make decisions on behalf of users.

This page is the working threat model for that newer class of system, and the index for the AI-security material on this site.

A note on terminology

AI security and AI safety are not the same discipline. Safety is the question of whether a model behaves as its operator intends — refusing harmful requests, avoiding bias, telling the truth. Security is the question of whether an adversary can subvert the operator’s intent: extract data, escalate privilege, exfiltrate secrets, manipulate other users, or pivot into adjacent systems.

Both matter. Most existing public benchmarks address safety. Most production incidents have been security. This page is about the second category.

The three layers

It is useful to think of an AI system as having three concentric layers, each with its own threat surface.

1. Model layer

The model itself is the innermost layer. Threats here are intrinsic to the weights, training data, or inference pipeline:

  • Training-data poisoning. Adversary contributes content to the training corpus that biases the model’s outputs in a targeted way.
  • Backdoor / trigger insertion. Specific input phrases produce attacker-chosen behavior at inference time.
  • Model extraction. Repeated queries reveal enough of the weights or behavior to reconstruct an equivalent model.
  • Membership inference. Adversary determines whether a specific record was in the training set — a privacy concern when training data is sensitive.
  • Prompt injection (direct). Adversarial input that overrides the system prompt or developer instructions.

Most organizations do not train their own foundation models, which means they inherit the model layer’s risk from the model provider. That makes vendor diligence, not internal red-teaming, the dominant control here. But model-layer assumptions can still fail at fine-tuning time, at retrieval time (RAG can act as a runtime poisoning channel), and at any boundary where untrusted text enters the model’s context.

2. System layer

The system layer is everything around the model: the retrieval pipeline, the prompt-construction logic, the conversation store, the surrounding application, and the data the model touches.

  • Indirect prompt injection. A document the model retrieves contains instructions targeting the model. The user did not write the instruction; the document did. This is the dominant production AI-security risk in 2026.
  • Insecure output handling. Application treats model output as trusted, executing it (in a SQL query, a shell command, a rendered HTML page) without escaping or validation.
  • Sensitive-data exposure. RAG pipelines that retrieve from a document store with mixed access levels can leak content the user is not authorized to see.
  • Resource exhaustion. Adversary crafts inputs that maximize token consumption, latency, or cost. Often classified as DoS but with a financial flavor.
  • Conversation poisoning. Adversary contaminates a shared or cached conversation that other users or agents subsequently consume.

System-layer threats are the most amenable to traditional security engineering: input validation, output encoding, authorization at retrieval time, rate-limiting. They also map cleanly to existing application-security disciplines.

3. Agentic layer

When the model is allowed to take actions — invoke tools, write to filesystems, call APIs, send messages — a third layer appears. Most of the genuinely novel risk in 2026 lives here.

  • Confused deputy across tools. The agent uses the authority granted for one tool to attack another. Authorization granted to each tool individually does not equal authorization for any combination.
  • Tool-description injection. Tool metadata (descriptions, parameter docs, return values) lands in the model’s context as effectively unsigned text. A malicious or compromised tool server can inject instructions there. See MCP Security: The New Attack Surface for a deep dive.
  • Goal hijacking. An untrusted document, tool output, or sub-agent persuades the agent to abandon its original objective.
  • Long-horizon credential creep. Agents accumulate scopes and tokens over a session. Without explicit revocation, an agent that started narrow ends up broad.
  • Sub-agent collusion. In multi-agent systems, an agent acting as planner, critic, or executor can coordinate (intentionally or otherwise) in ways no individual agent’s policy anticipates.
  • Persistent memory contamination. Agents that write to long-term memory create a new injection channel: poison the memory once, influence every future session.

The agentic layer is where existing application-security playbooks run out of vocabulary. Concepts like “confused deputy” and “ambient authority” come from operating-system security in the 1970s and apply here almost unmodified.

Reference frameworks worth using

Three frameworks are worth keeping at hand:

  • OWASP Top 10 for Large Language Model Applications. Practical, prioritized, application-developer-focused. Good for shipping checklists.
  • MITRE ATLAS. The ATT&CK-style adversary-behavior knowledge base for ML/AI systems. Good for red-teaming and threat-intelligence reporting.
  • NIST AI Risk Management Framework (NIST AI 100-1) and the Generative AI Profile (NIST AI 600-1). Good for governance, organizational controls, and conversations with leadership and auditors.

None of the three is sufficient alone. OWASP gives you the technical checklist; ATLAS gives you the adversary’s playbook; NIST gives you the language to talk about it across the organization.

Mapping to NIST 800-53

For organizations that have to answer to an Authorizing Official, here is a working mapping from the threat layers above to the relevant 800-53 control families. This is intentionally broad — most AI risks touch multiple families.

Threat area Primary 800-53 families
Training-data poisoning, RAG poisoning SI-7, SR, CM-7
Direct and indirect prompt injection SI-3, SI-10, SC-7
Insecure output handling SI-10, SC-8, SA-11
RAG sensitive-data exposure AC-3, AC-4, AC-6
Resource exhaustion / cost DoS SC-5, AU-12
Agentic confused deputy AC-3, AC-4, AC-6
Tool-description injection SI-3, SI-7, SC-7
Long-horizon credential creep AC-2, IA-5, AU-12
Persistent memory contamination SI-7, SI-10, AU-12

If the system you are evaluating is going through RMF, this mapping is the conversation-starter for the SSP’s narrative on AI-specific controls. Several of these will require tailoring or supplemental controls; the existing baselines were not written with agentic systems in mind.

Where to start

If you are doing this for the first time, in roughly this order:

  1. Inventory every model your organization actually uses, including via SaaS tools. The shadow-AI inventory is almost always larger than the sanctioned one.
  2. Identify which of those models can take actions, write to memory, or read from authoritative data. Those are your agentic systems.
  3. For each agentic system, threat-model the three layers. The agent layer should not be the first one you skip just because it is newest.
  4. Apply OWASP LLM Top 10 as a developer checklist, ATLAS as a red-team scoping tool, and NIST AI RMF as the governance wrapper.
  5. Update your 800-53 control narratives to reflect what you actually found.

The remaining AI-security material on this site — the Artificial Intelligence hub, the Large Language Models page, the MCP security post, and the steady stream of incident commentary — fits inside this scaffolding. Use this page as the index, and follow whichever thread is closest to the system you are responsible for.

Threat modeling is not a one-time activity. The agentic layer in particular is moving fast enough that any threat list more than six months old is already partially obsolete. Plan for revisions.