Claude Fable 5 writes first-ever verified megakernel on KernelBench-Mega for an 18.7x GPU speedup

VIEWS24.3KBOOKMARKS90LIKES264RETWEETS16REPLIES11

it took Claude Fable 2.5 hours to write a fused megakernel which delivers a >18x speed-up over a PyTorch baseline

now please recall that: - Fable is not the full Mythos model - Anthropic can spend much more than just 2.5h and ~550k tokens on this - they probably have better harnesses

Anthropic is definitely doing some sweet autoresearch internally. Especially architecture research bros are probably so happy at Anthropic. Imagine vibe-testing a new arch / tweak some arch and wanting to test it in a semi-optimized way. Just let 10T Mythos cook for a day.

Elliot Arledge@elliotarledge

Claude Fable 5 [max] wrote the first genuine (and fastest) megakernel ever submitted to KernelBench-Mega.

It was tested on: Kimi-Linear W4A16 batch-1 decode for RTX PRO 6000 Blackwell. Every prior model "won" it with a multi-kernel Triton pipeline that fails our single-fused-kernel authenticity gate

> Opus 4.8 at 14.4x > GLM-5.2 11.1x > GPT-5.5 4.3x > Sonnet 5 4.0x.

Fable shipped 18.7x over reference, and torch.profiler shows exactly ONE cooperative kernel launch per decoded token. Int4 dequant (nibbles unpacked in-register, never materialized), conv+SiLU, KDA gated-delta state, MLA absorbed-latent attention with online softmax, MoE router + top-8 experts, RMSNorms, even the KV cache append all inside one launch, staged by 14 grid barriers. We overwrote its input buffers mid-audit to prove it recomputes on live data. It does.

The advantage grows with context. 17.8x at 2k, 18.9x at 8k, 19.5x at 16k. Longer context means a bigger KV cache and more attention work per token which is usually where a decode kernel bleeds. Keeping everything in one launch amortizes the fixed barrier overhead and the int4 GEMV stays bandwidth-bound, so the gap over the reference widens instead of closing.

It spent 64% of the session in silence timing the baseline, microbenchmarking grid barriers, deriving a ~29x bytes/token roofline, then wrote the whole kernel once, hit 14.4x on the first benchmark, and spent the last hour deleting barriers and making int4 dequant free (one LOP3 + HSUB2/HMUL2). The one regression it tried (finer split-K) it measured and reverted instead of rationalizing.

http://kernelbench.com/mega

1h24.3K26490

Elliot Arledge@elliotarledge

DUDE!!! https://kernelbench.com/runs/20260701_172615_claude_claude-fable-5_02_kimi_linear_decode_solution.py.txt

2h673123

Elliot Arledge@elliotarledge

wasnt able to finish h100 and b200 sweeps due to rate limits (used the entirety of two 20x max coding plans (400 USD))

im open to sponsors to help keep this benchmark alive for such pricey models. hit me with a dm and we can figure something out so everyone can benefit :)

2h68814

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@elliotarledge Had a similar experience but on metal

2h3843

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@elliotarledge Metal retardkernels

Basically a single run of fable I barely have to do anything and it is better than like one week of Opus surpervised

For flux klein and qwen 3.6

2h801

Elliot Arledge@elliotarledge

@Rafa_Schwinger are you writing metal megakernels?

2h3652

Elliot Arledge@elliotarledge

Claude Fable 5 [max] wrote the first genuine (and fastest) megakernel ever submitted to KernelBench-Mega.

It was tested on: Kimi-Linear W4A16 batch-1 decode for RTX PRO 6000 Blackwell. Every prior model "won" it with a multi-kernel Triton pipeline that fails our single-fused-kernel authenticity gate

> Opus 4.8 at 14.4x > GLM-5.2 11.1x > GPT-5.5 4.3x > Sonnet 5 4.0x.

Fable shipped 18.7x over reference, and torch.profiler shows exactly ONE cooperative kernel launch per decoded token. Int4 dequant (nibbles unpacked in-register, never materialized), conv+SiLU, KDA gated-delta state, MLA absorbed-latent attention with online softmax, MoE router + top-8 experts, RMSNorms, even the KV cache append all inside one launch, staged by 14 grid barriers. We overwrote its input buffers mid-audit to prove it recomputes on live data. It does.

The advantage grows with context. 17.8x at 2k, 18.9x at 8k, 19.5x at 16k. Longer context means a bigger KV cache and more attention work per token which is usually where a decode kernel bleeds. Keeping everything in one launch amortizes the fixed barrier overhead and the int4 GEMV stays bandwidth-bound, so the gap over the reference widens instead of closing.

It spent 64% of the session in silence timing the baseline, microbenchmarking grid barriers, deriving a ~29x bytes/token roofline, then wrote the whole kernel once, hit 14.4x on the first benchmark, and spent the last hour deleting barriers and making int4 dequant free (one LOP3 + HSUB2/HMUL2). The one regression it tried (finer split-K) it measured and reverted instead of rationalizing.

http://kernelbench.com/mega