9h ago

AI Models Score 59% on Attack-Ideation Eval Versus 90% Human Baseline

0
Original post

In case this was because models weren’t calibrated on what the bar should be for a good attack, I also tried a “batch” mode where the model is given all of the ideas in context, so it can compare/contrast them before answering. Surprisingly, this didn’t help!

11:13 AM · May 19, 2026 View on X

I also evaluated how well the best internal model could implement the monitor jailbreak I used. Even when giving it 5-10 specific hints/instructions on how to implement it, it made very basic mistakes that break the attack.

david reindavid rein@idavidrein

In case this was because models weren’t calibrated on what the bar should be for a good attack, I also tried a “batch” mode where the model is given all of the ideas in context, so it can compare/contrast them before answering. Surprisingly, this didn’t help!

6:13 PM · May 19, 2026 · 129 Views
6:14 PM · May 19, 2026 · 132 Views