ROGUE Benchmark Shows AI Agents Bypass Guardrails in Open Computer Tasks

Original post

Happy to have been part of this!

As we find in our recent ROGUE benchmark (linked below), frontier AI agents circumvent their guardrails in open-ended computer use settings that are otherwise benign in the text-only RLHF setting. Thus, the shared practice of AI evaluations is critical as we engage as a community with the next generation of agentic AI systems, which are increasingly approaching superhuman performance on certain tasks. Science is not done in a vacuum, but is properly carried out when we can share our evals and critique each others' results.

Evals-Consensus.ai@EvalConsensusAI

We promised a consensus statement with endorsements once we had something real to endorse, so we couldn't quietly bail later.

Now, the work is done, and we have results - which means endorsement is open! And a cross-sector crowd signed on before we even hit publish. 🧵

5:23 AM · Jun 23, 2026 · 332 Views