2d ago

Claude Mythos Preview leads ExploitBench AI exploitation leaderboard

0

ExploitBench evaluates AI agents on exploiting vulnerabilities in the V8 JavaScript engine through staged tasks that progress to arbitrary code execution. Models receive scores across 16 capabilities under three evaluation conditions. Claude Mythos Preview records 69% mean capability while GPT 5.5 Codex variants range from 41% to 29% and Claude Opus 4.7 reaches 27%. Brendan Dolan-Gavitt posted the benchmark on X alongside a companion blog containing security researcher observations of model behavior.

Original post

This looks like an extremely interesting benchmark...

6:57 PM · May 14, 2026 View on X
Reposted by

End stage capitalism.

Brendan Dolan-GavittBrendan Dolan-Gavitt@moyix

This looks like an extremely interesting benchmark...

1:57 AM · May 15, 2026 · 23.4K Views
2:26 AM · May 15, 2026 · 11.6K Views

OK I guess Mythos really is meaningfully stronger than 5.5

s1r1us (mohan)s1r1us (mohan)@S1r1u5_

seems twitter missed the ExploitBench paper? few observations: we finally got good data on Mythos security capabilities and it's very impressive. Mythos got full exploit chain on 18/41 v8 n-days, while gpt 5.5 only got 1 and open source models are mostly useless.

3:49 PM · May 15, 2026 · 134.1K Views
4:52 PM · May 15, 2026 · 69.9K Views

earning those 💐

s1r1us (mohan)s1r1us (mohan)@S1r1u5_

seems twitter missed the ExploitBench paper? few observations: we finally got good data on Mythos security capabilities and it's very impressive. Mythos got full exploit chain on 18/41 v8 n-days, while gpt 5.5 only got 1 and open source models are mostly useless.

3:49 PM · May 15, 2026 · 134.1K Views
10:12 PM · May 15, 2026 · 2.2K Views
Claude Mythos Preview leads ExploitBench AI exploitation leaderboard · Digg