🧵 We have bad news about prompt injections. #prompt_injections_so_back
⚠️They may be fundamentally unsolvable for AI agents doing anything complex in real contexts. Though, there is some hope
The common narrative (e.g. CaMeL/SecAlign, etc) assumes you assumes you can separate data from instructions 🧱. But, commonly, in anything more complex than a toy demo, that distinction won’t hold. An email saying "the department head approved X" isn't an instruction, but could dramatically change agentic behavior whether this claim is true or not.
It's a contextual claim that the agent may not be able to verify. Probabilistic defenses now don’t catch this. And any deterministic defense that blocks all such claims also blocks the legitimate ones.
Fundamentally, an autonomous agent’s operating context might contain instructions everywhere: any interaction with a third-party or use of memory or skills are instructional by design 💬.
Instead, in our new paper with @ebagdasa, we reframe prompt injection through Contextual Integrity and show that:
🔴 Current classifiers can't detect contextual attacks
🔴 Safety training (SecAlign) makes BOTH security and utility worse
🔴 A CI-informed red-team loop hits 96.7% attack success on frontier models, that also transfer to other models
🔴 Even without any attacker, agents fail to separate information flows or respect delegation boundaries
🔴 An impossibility argument: no fixed policy prevents all context attacks without also blocking legitimate ones
📄 http://arxiv.org/abs/2605.17634
