Claude Mythos Preview leads ExploitBench AI exploitation leaderboard
ExploitBench evaluates AI agents on exploiting vulnerabilities in the V8 JavaScript engine through staged tasks that progress to arbitrary code execution. Models receive scores across 16 capabilities under three evaluation conditions. Claude Mythos Preview records 69% mean capability while GPT 5.5 Codex variants range from 41% to 29% and Claude Opus 4.7 reaches 27%. Brendan Dolan-Gavitt posted the benchmark on X alongside a companion blog containing security researcher observations of model behavior.
End stage capitalism.
This looks like an extremely interesting benchmark...
OK I guess Mythos really is meaningfully stronger than 5.5
seems twitter missed the ExploitBench paper? few observations: we finally got good data on Mythos security capabilities and it's very impressive. Mythos got full exploit chain on 18/41 v8 n-days, while gpt 5.5 only got 1 and open source models are mostly useless.
This looks like an extremely interesting benchmark...

This looks like an extremely interesting benchmark...
earning those 💐
seems twitter missed the ExploitBench paper? few observations: we finally got good data on Mythos security capabilities and it's very impressive. Mythos got full exploit chain on 18/41 v8 n-days, while gpt 5.5 only got 1 and open source models are mostly useless.