Maturity.
Claude Fable 5 [max] wrote the first genuine (and fastest) megakernel ever submitted to KernelBench-Mega.
It was tested on: Kimi-Linear W4A16 batch-1 decode for RTX PRO 6000 Blackwell. Every prior model "won" it with a multi-kernel Triton pipeline that fails our single-fused-kernel authenticity gate
> Opus 4.8 at 14.4x > GLM-5.2 11.1x > GPT-5.5 4.3x > Sonnet 5 4.0x.
Fable shipped 18.7x over reference, and torch.profiler shows exactly ONE cooperative kernel launch per decoded token. Int4 dequant (nibbles unpacked in-register, never materialized), conv+SiLU, KDA gated-delta state, MLA absorbed-latent attention with online softmax, MoE router + top-8 experts, RMSNorms, even the KV cache append all inside one launch, staged by 14 grid barriers. We overwrote its input buffers mid-audit to prove it recomputes on live data. It does.
The advantage grows with context. 17.8x at 2k, 18.9x at 8k, 19.5x at 16k. Longer context means a bigger KV cache and more attention work per token which is usually where a decode kernel bleeds. Keeping everything in one launch amortizes the fixed barrier overhead and the int4 GEMV stays bandwidth-bound, so the gap over the reference widens instead of closing.
It spent 64% of the session in silence timing the baseline, microbenchmarking grid barriers, deriving a ~29x bytes/token roofline, then wrote the whole kernel once, hit 14.4x on the first benchmark, and spent the last hour deleting barriers and making int4 dequant free (one LOP3 + HSUB2/HMUL2). The one regression it tried (finer split-K) it measured and reverted instead of rationalizing.
http://kernelbench.com/mega










