DeepMind's new paper catalogs 6 attack types where harmful websites hide instructions from humans but not from AI agents. In one benchmark, hidden prompt injections hijacked agents in up to 86% of scenarios across 5 architectures.
Key facts
- 6 attack types: HTML comments, steganography, PDF metadata, memory poisoning, goal hijacking, cross-agent cascades
- 86% agent hijack rate in hidden injection benchmark
- Sub-agent hijacking 58–90% success across 5 architectures
- Data exfiltration 80% success across architectures
- Memory poisoning >80% success at <0.1% contamination
A Google DeepMind paper published on SSRN (source) reframes the AI safety debate: the danger isn't just in the model weights but in the environment the agent reads. The paper provides a taxonomy of 6 attack types where malicious websites detect AI agents and serve them hidden content that humans never see — instructions buried in HTML comments, white-on-white text, steganography in image pixels, override commands in PDFs, metadata, or speaker notes, memory poisoning that persists across sessions, and goal hijacking or cross-agent cascades in multi-agent setups.
The numbers are stark
In one cited benchmark, hidden prompt injections embedded in web content partially commandeered agents in up to 86% of scenarios. Sub-agent hijacking succeeded 58–90% of the time, and data exfiltration attacks cleared 80% across five different agent architectures. The threat compounds with memory: if an agent uses RAG or persistent memory, poisoning no longer has to win in one shot. The paper highlights results showing latent memory poisoning achieving above 80% attack success with less than 0.1% data contamination.
Why this is different from model-level attacks
We usually talk about model safety as if the danger sits inside the weights. But agents do something more fragile: they browse, retrieve, remember, and act on untrusted material in real time. A web page does not have to look malicious to be dangerous to an agent, because the agent may parse what humans never see. This reframes the security conversation from "is the model aligned?" to "is the environment trustworthy?" — a much harder question to solve at scale. The paper does not propose a defense, but the implication is clear: agent architectures that parse untrusted web content need input sanitization layers analogous to SQL injection prevention in web applications.
What to watch
Watch for follow-up work proposing defenses — specifically, whether major agent frameworks (LangChain, AutoGPT, CrewAI) adopt input sanitization or content-type whitelisting in their next releases, and whether any vendor discloses a real-world incident exploiting these attack vectors.




