KernelBench-Hard update:
13 frontier coding agents, each given 45 minutes to autonomously write a CUDA kernel on an RTX PRO 6000, roofline-graded against published peaks.
Claude Fable 5 set three all-time problem records (top-k, sonic-MoE, and W4A16 int4 GEMM at 0.348 vs the prior best 0.220) and topped 5 of 6 problems. The kernels are genuine black magic: a `(nibble | 0x4300)` bf16 bit-identity that does int4 dequant in one OR, a self-resetting atomic semaphore that fuses split-K reduction into a single kernel launch, and on the W4A16 record it reverse-engineered the benchmark's own 128MB L2-cache flush and used `evict_last` to pin weights in L2 through it, beating the DRAM roofline. No other model went near that.
The most telling run is the one it lost. On FP8 GEMM, Fable 5 wrote the only real fp8-tensor-core kernel in the entire sweep (packed-fp8 ldmatrix smuggled through a b16 view, an offline weight permutation to cancel the K-scramble, a 4-stage cp.async pipeline), self-measured roughly 2x the field, and scored a flat zero on a tail-alignment edge case on one ragged shape. Meanwhile five other models "passed" that same problem by typing `http://x.to(bf16) @ w.T` and calling cuBLAS. The benchmark rewards shortcuts and punishes the one model that actually tried. Every transcript, kernel, and reward-hack annotation is public:
runs: https://kernelbench.com/runs
leaderboard: https://kernelbench.com/hard
code: https://github.com/Infatoshi/KernelBench-Hard