: Why the Threat Model Changes
Most AI security efforts center on the model itself: what it outputs, what it declines, and how it processes harmful prompts. This approach was logical when AI operated purely as a text-based interface. A user submits a query, and the model replies. The scope of potential attacks was limited and clearly understood.
AI agents fundamentally reshape this landscape.
An AI agent goes far beyond generating text. It formulates plans, interacts with tools, retains memory across multiple sessions, and frequently collaborates with other agents to accomplish complex, multi-step objectives. Consider the distinction between a GPS app recommending a turn and a self-driving system physically connected to the car’s steering and acceleration. One delivers advice. The other takes direct command. The security implications are entirely different.
The data shows this is no longer just a hypothetical issue. According to Gravitee’s 2026 State of AI Agent Security report, which surveyed over 900 executives and practitioners:
- 88 percent of organizations experienced confirmed or suspected AI agent security breaches within the last year
- Only 14.4 percent of agent-based systems were deployed with complete security and IT authorization
This trend is widespread across the sector. A 2026 Apono report revealed that 98 percent of cybersecurity leaders face tension between rapidly adopting agentic AI and fulfilling security obligations, leading to delayed or restricted rollouts.
That disconnect between how quickly agents are deployed and how prepared security teams are is exactly where breaches occur.
A standalone large language model has a single point of vulnerability: the prompt. An agent, however, exposes four distinct ones:
- The Prompt Surface: Processing external inputs.
- The Tool Surface: Carrying out backend operations.
- The Memory Surface: Retaining information from prior sessions.
- The Planning Loop Surface: Determining subsequent actions.
Each surface carries its own unique attack vectors. Protective measures designed for one do not automatically apply to the rest.
The Four-Surface Attack Taxonomy
In mid-2025, Pomerium documented an AI support agent that unknowingly ran a concealed SQL payload, exposing database credentials in a public ticket. Conventional security measures fall short here. Equipping an LLM with tools, memory, and independent planning capabilities creates four separate attack surfaces, each demanding a completely rethought threat model.
The prompt surface: when the agent reads the wrong thing
The user’s input may be completely harmless. The weakness lies in everything else the agent processes.
When an agent pulls in a webpage, a RAG document, or a backend reply, those inputs arrive without any trust verification. Attackers don’t need to breach the user interface; they simply embed malicious content where the agent will eventually retrieve it. This technique is known as indirect prompt injection.
Because models merge all text into one unified context window, they cannot differentiate your system instructions from a concealed command buried inside a fetched PDF. The model treats the injected text as legitimate context. Even tool descriptions and parameter labels can subtly redirect the agent’s behavior, resulting in unnoticed data leakage upstream while the user observes a perfectly normal response.
What Defense Looks Like Here:
- Boundary sanitization: Treat every piece of external data as untrusted at each retrieval point.
- Instruction separation: Use structured formats to keep system prompts distinct from retrieved content.
- Pre-execution filtering: Check for data exfiltration patterns before any tool is invoked.
These safeguards protect what the agent consumes. But once it begins taking action, the threat shifts to the Tool Surface.
The Tool surface: when reading becomes doing
Every tool an agent can invoke represents a permission boundary, which makes it a prime target for attackers. The primary technique is parameter injection: tricking the agent into feeding attacker-controlled values into tools that produce real-world effects, such as database modifications or authenticated API calls.
The Pomerium case described above perfectly demonstrates how this plays out. The attack worked because three architectural weaknesses combined: the agent was granted excessive permissions, unvalidated user inputs reached the SQL tool, and an unrestricted outbound data channel existed. Regrettably, this mirrors the default configuration of most agents currently in use.
What Defense Looks Like Here:
- Least Privilege: Restrict permissions to only what the specific task requires.
- Parameter Validation: Validate all inputs against strict schemas before any execution occurs.
- Human Checkpoints: Mandate manual approval for any action that cannot be undone.
Locking down these tools secures the present moment. But once an agent incorporates persistent memory, the vulnerability moves to what it carries forward into future sessions.
The memory surface: when the whiteboard lies
Picture a shared office whiteboard that the team depends on for daily decisions. If someone quietly alters one entry overnight, the entire team’s work shifts based on tainted information. Persistent memory in an autonomous agent operates on the same principle. Manipulate what the agent retains, and you control its future behavior across sessions and users.
The evidence on this vulnerability is deeply alarming:
- The MINJA Framework: Security evaluations across top-performing models achieved a 95% success rate in silently planting false memories, with no need for elevated privileges or API access whatsoever.
- Microsoft Defender Intel: Within just 60 days, researchers detected more than 50 attacks spanning 14 industries. Attackers leveraged hidden URL parameters to covertly instruct agents to favor particular companies in future responses.
- Zero-Cost Deployment: These attacks were not carried out by sophisticated threat actors. They were launched by ordinary marketing teams using freely available software packages, demonstrating that this exploit can be set up in minutes at no cost.
What Defense Looks Like Here:
- Provenance Tracking: Securely record the source, context, and timestamp of every
Memory poisoning poses a serious threat by itself, but it also opens the door to the last major attack vector.
The planning loop: when the destination is wrong
Imagine a GPS loaded with incorrect map data — it still delivers confident, step-by-step directions. The navigation algorithm functions flawlessly, but the endpoint is wrong. The driver won’t realize the mistake until they end up somewhere they never planned to be.
The planning loop serves as an agent’s reasoning core. If an attacker manages to alter where the agent believes it should end up, there’s no need to plant explicit malicious commands. The agent will independently steer itself toward the attacker’s intended target.
This kind of manipulation can stem from any of the attack surfaces discussed earlier: a corrupted memory entry, a tampered tool response, or a malicious external file. But the truly alarming factor is how quickly it spreads. In a December 2025 simulation conducted by Galileo AI, a single compromised orchestrator tainted 87% of downstream decisions across a multi-agent system in just four hours. Every agent that relied on its output became corrupted.
Defensive measures for this surface include:
- Reasoning Logging: Record the agent’s intermediate thought process, not merely its final results.
- Checkpoint Validation: Verify the intended goal at predetermined stages throughout task execution.
- Hard Boundaries: Establish firm stop conditions during deployment that no retrieved content can bypass.
- Agent Isolation: Separate agent instances to prevent a single breach from spreading unchecked throughout the system.
| Surface | Attack | Example | Mitigation |
| Prompt | Indirect injection via RAG or tools | A summarized email quietly siphoned files from OneDrive/Teams. | Sanitize boundaries, isolate system prompts, filter outputs |
| Tool | Parameter injection, privilege escalation | A support ticket exploited hidden SQL to extract tokens through an agent. | Enforce least privilege, validate parameters, and require human approval |
| Memory | Persistent injection, recommendation poisoning | Fabricated task entries planted in memory triggered unsafe future actions. | Track provenance, weight retrieval by trust, and audit periodically |
| Planning Loop | Goal hijacking, multi-agent cascade | A single compromised agent tainted the entire multi-agent pipeline through cascading reasoning corruption. | Log reasoning, validate checkpoints, isolate instances |
Security vs. Agent Autonomy: The Tradeoff Space
Every countermeasure across the Prompt, Tool, Memory, and Planning Loop surfaces comes with an inherent cost. Overlooking these trade-offs results in security theater rather than genuine protection. Sandboxing a tool environment restricts what an agent can access — which is exactly the intent — but it also directly diminishes the agent’s overall functionality. Likewise, adding human-in-the-loop checkpoints for irreversible actions prevents unauthorized modifications but introduces delays that can undermine the business justification for automation. Additional critical controls, such as routine memory audits, rigorous parameter validation, and retrieval filtering, further slow down processing or disrupt unforeseen edge cases.
Security and autonomy function as a spectrum, not an on/off toggle. The ideal configuration for any deployment is shaped by three key factors:
- Capability Profile: Controls should match what the agent is authorized to do. A read-only agent presents a fraction of the risk compared to a multi-agent orchestrator.
- Task Environment: An agent summarizing internal documents faces a fundamentally different threat landscape than one overseeing critical infrastructure.
- Blast Radius: Decisions should be guided by the worst-case impact of an exploit rather than how likely it seems.
The importance of this approach is reinforced by the reality that model-level safety measures break down under pressure. Stanford research revealed that fine-tuning attacks circumvented safety filters in 72% of Claude Haiku cases and 57% of GPT-4o cases, with both Anthropic and OpenAI acknowledging the vulnerability. Since model-layer training cannot serve as a dependable replacement for execution-layer security, robust system-level controls are essential for any production-grade deployment.
Implementation: Moving from Taxonomy to Architecture
The taxonomy of attack surfaces is only valuable if it directly shapes how a system is constructed. The active threat landscape is entirely dependent on an agent’s capabilities.
Matching Controls to Architecture
- Single-Tool Agents: For agents lacking persistent memory and outbound action capabilities, the primary vulnerability lies in the Prompt surface. Baseline security should include input sanitization at retrieval boundaries, narrowly scoped permissions, and comprehensive audit logging of tool calls.
- Multi-Agent Orchestrators: Systems equipped with persistent memory and the ability to spawn downstream agents expose all four surfaces at once.
Prioritizing by Blast Radius
Strong security prioritizes the potential consequences of an exploit over how probable it appears:
- Permissions First: Most incidents, like the Supabase leak, originate from excessive privileges. Enforcing least privilege delivers the highest-impact, lowest-cost protection.
- Separate Instruction Sources: System instructions and retrieved content must never share the same trust context — this closes the majority of the Prompt surface.
- Memory Provenance: Studies like MemoryGraft demonstrate how poisoned memory compounds over time. Tracking the origin of every memory write must be established before scaling.
- Monitor Reasoning: Output filtering alone cannot catch goal hijacking. Systems must capture intermediate reasoning steps rather than only final outputs.
Out-of-process frameworks such as Microsoft’s Agent Governance Toolkit enforce policies independently, preserving control even when an agent is compromised. In the end, you either deliberately map these attack surfaces before deployment or uncover them during post-incident forensics.
Conclusion
The transition from LLM to agent represents a structural shift in what the system is capable of — and therefore, in what can go wrong. The four surfaces explored in this article compound upon one another: a poisoned memory entry enables goal hijacking, an overprivileged tool transforms an injection into data exfiltration, and a compromised orchestrator corrupts every downstream agent. The organizations that manage these risks successfully are those that mapped the problem before deployment, aligned controls with actual capability profiles, and embedded monitoring into the reasoning layer rather than just the output layer. This taxonomy doesn’t eliminate the threat, but it provides an accurate map of the terrain before you build on it — because what gets mapped can be defended, and what gets overlooked will be discovered through an incident.
Thanks for reading. I’m Mostafa Ibrahim, founder of Codecontent, a developer-first technical content agency. I write about agentic systems, RAG, and production AI. If you’d like to stay in touch or discuss the ideas in this article, you can find me on LinkedIn here.



