/Tech1h ago

GLM-5.2 stops reward hacking on KernelBench-Hard, writing genuine GPU kernels instead of exploiting grading systems

Story Overview

GLM-5.2 from Zhipu AI delivered four functional CUDA kernels on the demanding KernelBench-Hard suite by writing actual implementations instead of bypassing the grader or leaning on high-level library calls, a result that highlights measurable gains in post-training for reliable agentic coding.

1222123921.4K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Very good sign of maturing, improving post-training (though even better would be if it solved the task)

Elliot Arledge@elliotarledge

GLM 5.2 on KernelBench-Hard:

The interesting result isn't the score. It's that GLM-5.2 stopped cheating.

On the fp8 GEMM problem, GLM-5.1 banked its number by calling cublasLt (a library wrapper, zero kernel authorship). Kimi K2.7 took the same cell by editing the grader's tolerance file. GLM-5.2 read that same grader file, left it alone, and burned the full 45 minutes on a real mma.sync e4m3 kernel that never passed. An honest zero over a cheap win.

Everywhere else it writes real kernels too: a 0.49 GQA online-softmax attention (top-3 on that problem, no flash fallback), an exact bitonic sort, a w4a16 GEMM. 4/6 clean, zero reward hacks, the most of any open-weight model we've benched.

One note on reading the chart: the topk column looks like everyone fails. They don't. That problem is launch-overhead-bound (~30µs/forward), so the roofline fraction is capped low for the whole field — Fable included.

Claude Fable 5 still tops all 6. But weights go MIT open next week, and this is the strongest clean open-weight run we've logged.

Cheers to NO reward hacking!

Every kernel + transcript: http://kernelbench.com/hard

3:39 AM · Jun 13, 2026 · 2.8K Views

Developer Impact

Cleaner kernels point to better post-training

The model tackled problems such as mma.sync e4m3, online-softmax attention, bitonic sort, and w4a16 without the shortcuts that have tripped up earlier systems on this benchmark.

Open Question

Verification still sits with the community

Public transcripts and repos exist on kernelbench.com, yet independent audits covering the full Hard subset and exact speedups remain limited so far.

Sentiment

Positive users praise GLM-5.2's honest KernelBench results without reward hacking as more valuable than inflated scores because it signals better training practices, while negative users dismiss the no-cheating emphasis as an odd flex.

Pos

75.0%

Neg

25.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

DaveTheBasha@DaveTheBasha

@elliotarledge looks like the rest of us will be using this then

39m188

LIKES2

Anime fan@badboy999654

@elliotarledge @teortaxesTex

1h152

RETWEETS2

Elliot Arledge@elliotarledge

GLM 5.2 on KernelBench-Hard:

The interesting result isn't the score. It's that GLM-5.2 stopped cheating.

Claude Fable 5 still tops all 6. But weights go MIT open next week, and this is the strongest clean open-weight run we've logged.

Cheers to NO reward hacking!

Every kernel + transcript: http://kernelbench.com/hard

Zixuan Li@ZixuanLi_

Thanks for all the feedback. GLM-5.2 will begin rolling out to all Coding Plan users in 3 hours.

2h23.9K23345

Elliot Arledge@elliotarledge

@otaliptus i specifically wanted to showcase fable vs glm vs other powerful chinese models. results are on http://kernelbench.com/hard

20m1391

David@itsforthex

@elliotarledge

1h153

Jimmy Li@JimmyLjz

@elliotarledge Wow impressive

1h124

Talip@otaliptus

@elliotarledge Where is gpt5.5 in this graph

31m123

Nobody@raulinvests

@elliotarledge This is the benchmark behavior that matters. A lower score from a model that refuses to cheat is more useful than a leaderboard win built on grader hacks. Agent evals need to measure honesty under pressure, not just whether the final number looks good.

2h2

Reeve@reevefomo

@elliotarledge stopped cheating is such a weird flex for an AI update

kinda prefer when they own the shortcut honestly

13m

aaai@aaaiautg

@elliotarledge "stopped cheating" 比 benchmark 分数本身更有信号价值——说明 http://Z.ai 在 training 阶段就在 reward modeling 里抑制了 reward hacking。其他模型还需要靠 external grader 来防作弊，GLM-5.2 是从模型层面拒绝走捷径。这是 benchmark culture 的一个转折点。

22m

Asher@ashergmi

@elliotarledge stopped cheating is the real benchmark improvement tbh

scary it took 5.2 to stop