/Tech4h ago

Epoch AI evaluation finds Mythos excels at software exploitation, but Jaime Sevilla attributes its vulnerability discovery gains to higher inference budgets

Anthropic's Ethan Perez flagged the model's automated exploitation capabilities

9201196312.1K

#151

Original post

Epoch AI@EpochAIResearch

How big a leap is Mythos in cyber capabilities?

@timotheechauvin, @AlexBarry4, @js_denain, and @ansonwhho compiled the public evidence and found that while it’s unclear if Mythos was ahead of trend in discovering vulnerabilities, it represents a big jump in exploiting them. 🧵

2:16 PM · Jun 11, 2026 · 11.1K Views

/Tech4h ago

Epoch AI evaluation finds Mythos excels at software exploitation, but Jaime Sevilla attributes its vulnerability discovery gains to higher inference budgets

Anthropic's Ethan Perez flagged the model's automated exploitation capabilities

9201196312.1K

#151

Original post

Epoch AI@EpochAIResearch

How big a leap is Mythos in cyber capabilities?

2:16 PM · Jun 11, 2026 · 11.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5KBOOKMARKS1

Epoch AI@EpochAIResearch

The bottom line is that Mythos Preview seems clearly ahead of trend in exploiting vulnerabilities, but we lack similarly strong evidence for finding vulnerabilities. Either way, its cyber capabilities aren’t just “hype”, and we expect the same to be true for Mythos 5.

4h1.5K131

LIKES15

Epoch AI@EpochAIResearch

Public evidence clearly suggests that Mythos Preview represents a big leap in vulnerability exploitation.

Aggregated scores on cyber benchmarks (which mainly measure exploit development) show that Mythos Preview is ~7 months ahead of trend, compared to ~3 months for GPT-5.5.

4h45415

REPLIES1

pozitiv4ik@Gc_qube

@EpochAIResearch @scaling01 @timotheechauvin @AlexBarry4 @js_denain @ansonwhho why were Mythos April and Mythos Early released on the same day

4h147

Epoch AI@EpochAIResearch

We mainly looked at public evidence from Mythos Preview, where we have the most data. According to Anthropic, Mythos 5 is only a “modest” upgrade over Mythos Preview for cybersecurity, and so our conclusions should largely apply there too.

4h89915

Epoch AI@EpochAIResearch

Importantly, some benchmarks were run on an early checkpoint of Mythos Preview, which was notably weaker. This, alongside many of the initially available benchmarks reaching saturation, is probably why some earlier analyses found Mythos Preview to be similar to GPT-5.5.

4h36914

Epoch AI@EpochAIResearch

When Anthropic first announced Mythos Preview, they called it a big leap in two cyber capabilities: 1. Vulnerability discovery: finding weaknesses in a codebase 2. Vulnerability exploitation: given a weakness, designing code that takes advantage of it

4h74213

Epoch AI@EpochAIResearch

For vulnerability discovery, there aren’t any unsaturated benchmarks we can look at. One source of evidence we can consider is public databases of tracked vulnerabilities for companies involved in Project Glasswing. These show a big spike around when Mythos Preview was announced.

4h35513

Epoch AI@EpochAIResearch

What does this mean in practice? Across multiple benchmarks, Mythos Preview is often able to use public information about real-world vulnerabilities to construct exploits to fully compromise affected systems. These capabilities could enable widespread and damaging cyberattacks.

4h35412

Epoch AI@EpochAIResearch

But it’s not clear that this spike reflects improved vulnerability discovery capabilities — it’s possible that prior models were also good at this, and the spike mostly reflects a surge in investment. After all, Project Glasswing involves up to $100M in API credits.

4h32411

Epoch AI@EpochAIResearch

Indeed, there’s evidence that AIs were good at finding vulnerabilities prior to Mythos.

E.g., the maintainers of the curl software library used AI code scanners prior to Project Glasswing, and Mythos Preview only managed to find one additional low-severity vulnerability.

4h31310

Epoch AI@EpochAIResearch

That said, several companies in Project Glasswing highlighted Mythos Preview’s low false positive rate when finding vulnerabilities. This could be a big deal in practice because it cuts down on a lot of the human labor needed to check if a vulnerability is actually real.

4h30110

Epoch AI@EpochAIResearch

This week’s Gradient Update was written by @timotheechauvin, @AlexBarry4, @js_denain, and @ansonwhho.

All Gradient Updates are informal, opinionated analyses that represent the views of individual authors, not Epoch AI as a whole.

Full essay: https://epochai.substack.com/p/are-mythos-cyber-capabilities-overhyped

4h1.1K12

Alexander Barry@AlexBarry4

@Gc_qube @EpochAIResearch @scaling01 @timotheechauvin @js_denain @ansonwhho It wasn't very clear what date to give to the 'early' release, since it was only given to a handful of places to evaluate (and we only know they got access in March but not the exact date). So for simplicity we just aligned it with April 7th.

3h172

Neuralease@neuralease

@EpochAIResearch @timotheechauvin @AlexBarry4 @js_denain @ansonwhho Surely this isn't the largest model we can manage in modern day, OpenAI trained it's own 10T back when Grok 3 was all the rage.

4h138

PHOTON COURIER@Ahmourinabil20

The Ideal Structure k+1+k: The Guaranteed Majorana Zero-Mode, the Parity-Split Variance Floor, and the Conditional Linear Response of the Strong-Coupling Real Elliptic Ginibre Ensemble http://doi.org/10.5281/zenodo… We study the real elliptic Ginibre ensemble S = H + gA in the strong-coupling limit, where H is a symmetric Gaussian matrix and A a real antisymmetric Gaussian matrix. We prove that the variance of the real parts of the eigenvalues converges to a deterministic, parity-split floor: (n−2)/[2(n−1)] for even dimension and exactly 1/2 for odd dimension. This finite-n even/odd distinction — commonly assumed absent at leading order in the asymptotic literature — is shown to be forced by the Altland–Zirnbauer class-D structure of the operator B = iA, whose spectrum takes the exact form of an ideal structure n = 2k+1: k particle modes (+λ), k antiparticle modes (−λ) bound to them by exact particle-hole conjugation, and one neutral Majorana mode pinned at zero.

The guaranteed neutral mode is the pivot of the entire structure. We establish that it exists by oddness, that it is its own particle-hole conjugate (a Majorana mode), that it is statistically independent of the side modes (kernel–image orthogonality), and that its real part has variance exactly one — the precise origin of the odd-dimensional excess. We then determine the conditional linear response E_A[Σ Re²] = α·tr(H²) + β·(tr H)² in closed form for both parities, finding that α carries the zero-mode fingerprint (α_even = (n−1)/[n(n+2)], α_odd = 1/(n+2)) while β = (n+1)/[n(n+2)] is parity-independent. Finally we localize the finite-coupling correction: the protected neutral mode carries only a 1/g² correction, whereas the 1/g term is carried by the near-real, persistently-defective modes.

Every analytical result is derived from first principles (degenerate perturbation theory, Gaussian–Wick contraction, kernel–image orthogonality) and independently verified by high-statistics simulation at a fixed seed. The work positions these results against the current literature on real-eigenvalue statistics of the elliptic Ginibre ensemble (to mid-2026), and states explicitly the single quantity whose closed form remains open, together with the precise reason for the obstruction. The methodological discipline is strict throughout: no fitting presented as derivation, no manual adjustment, every limit of the method named honestly.

@mathemetica @Math_files @amermathsoc @amermathsoc @cambUP_maths @PhysInHistory @zone_astronomy

3h5

Darshan Yadav@DarshanSays

The distinction matters - unclear on *discovering* vulnerabilities vs. documented jump in *exploiting* them. Those are different threat models.

Exploitation assistance scales directly with attacker sophistication. Worth reading the full thread before deciding what this means for your defensive tooling and red team assumptions.

4h1