AI Models Score 59% on Attack-Ideation Eval Versus 90% Human Baseline
——0——
I also evaluated how well the best internal model could implement the monitor jailbreak I used. Even when giving it 5-10 specific hints/instructions on how to implement it, it made very basic mistakes that break the attack.
In case this was because models weren’t calibrated on what the bar should be for a good attack, I also tried a “batch” mode where the model is given all of the ideas in context, so it can compare/contrast them before answering. Surprisingly, this didn’t help!
6:13 PM · May 19, 2026 · 129 Views
6:14 PM · May 19, 2026 · 132 Views