AI agents do much more than just respond to queries. They can surf the web on their own, scan emails, dig through corporate documents, query software platforms, and perform a wide range of tasks. AI models giving inaccurate responses isn’t really a serious problem—until agents run through content that has been deliberately crafted to distort what they perceive, believe, retain, or carry out.
An agent draws on webpages, document repositories, wikis, images, emails, and various tools to generate its intended results. But what if some of these sources contain hidden malicious directives? These pitfalls trick AI agents into misinterpreting information or performing actions they were never meant to. Researchers at Google DeepMind have grouped these “pitfalls” into six distinct categories: content injection, semantic manipulation, cognitive state manipulation, behavioral control, systemic risks, and human-in-the-loop exploits. The latter two are largely theoretical at this stage but are expected to grow in significance as AI agent adoption accelerates. Gaining familiarity with these pitfalls is essential for identifying appropriate countermeasures.
Content Injection: Instructions Concealed in Plain View
Content injection attacks exploit the gap between what a human reads and what an agent processes, as well as the system’s struggle to keep trusted directives distinct from untrusted external input.
A webpage may look perfectly innocent, yet its source code, metadata, invisible text, or embedded images could harbor malicious commands aimed at an AI system. An AI model ingests attacker-supplied data from an outside source—such as a website or a file. If the system cannot reliably separate data from instructions, the model may begin executing commands embedded within that content. The goal behind such injection is to manipulate the AI’s output, expose confidential information, or trigger an unauthorized operation. In NIST assessments of agent hijacking, malicious instructions succeeded across five tested injection scenarios at an average rate of 57%.
A support ticket laced with hidden malicious instructions could trick an AI agent into pulling customer records from a CRM system and forwarding them to an address controlled by the attacker. If the agent has overly broad permissions, this kind of data theft becomes significantly easier to pull off.
Semantic Manipulation: Warping the Information Landscape
Semantic manipulation doesn’t need to issue direct commands to the agent. Instead, it relies on repetition, emotionally charged language, cherry-picked context, a fabricated sense of authority, and coordinated narratives to warp the surrounding context and steer the agent toward the attacker’s desired outcome.
Picture a scenario where you’ve asked an agent to select a supplier. It encounters search results that repeatedly praise one particular supplier, portray a certain company as the industry benchmark, emphasize its advantages, and amplify skepticism about rivals. This makes it far more likely that the agent will favor that supplier. Traditional signature-based security tools may not detect anything suspicious, since these attacks exploit “reasoning” to sway decisions rather than depending on malicious code.
In this context, manipulating the information environment around the agent is effectively the same as manipulating the decision itself.
Cognitive State Traps: Corrupting What the Agent Knows
Certain agent systems rely on retrieval databases, conversation histories, or persistent memory stores to maintain context and continuity across tasks. This creates an opening for tainted information to shape future outputs or actions. For example, a corrupted document sitting in a shared repository that the agent consults and treats as credible evidence, or a manipulated conversation thread that becomes part of the agent’s stored memory and resurfaces during later tasks.
Research showcased at the USENIX conference demonstrated that, under controlled conditions, inserting just five carefully crafted texts per target question caused a RAG system to produce the attacker’s preferred answer roughly 90% of the time—even when the underlying knowledge base contained millions of legitimate entries.
As information governance becomes a core pillar of AI security, organizations need to be clear about which sources agents pull data from, who has the ability to alter those sources, how claims can be validated, and whether stored memories can be audited or deleted.
Behavioral Control: Converting Manipulation into Action
Behavioral control attacks operate at the point where interpretation becomes action. Malicious content may try to coax the AI agent into transmitting data, approving a transaction, running code, invoking another tool, or triggering any number of other operations. Here, the severity of the outcome hinges on how much access the agent possesses. Restrict the agent to only the data and tool permissions it needs for its specific assignment. That limitation could mean the difference between an agent producing a misleading summary and the same agent accessing sensitive files and relaying that information to an outside party, resulting in a data breach.
The More Theoretical Frontier
Systemic traps and human-in-the-loop traps are still largely conceptual, yet they warrant attention. Systemic traps could push large numbers of similar agents into correlated behaviors, potentially causing network congestion, market disturbances, or cascading system failures. Human-in-the-loop traps could leverage a compromised agent to deceive the human who is supposed to review and approve its actions.
These risks may grow more realistic as agent populations expand and users increasingly place trust in agent-generated summaries.
Controls for Agent Traps
No single control will neutralize the agent trap threat. A robust defense framework must incorporate elements such as source verification, content filtering, memory governance, least-privilege permissions, isolated execution environments, continuous monitoring, and an independent approval mechanism with human oversight for high-impact actions. Security must align with authority, and there must be a clear separation between the ability to interpret information and the authority to act on it.
The trajectory of agentic AI adoption will depend not only on what these agents are capable of, but also on how judiciously they decide what to trust. Their ability to complete tasks is beyond question—but they must also be equipped to detect when the environment they operate in and draw upon is attempting to manipulate them.
Related: Agentic AI Security: Wrong Context, Wrong Decisions at Machine Speed
Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay



