Reward hacking is old news.
The hard part: finding the judge’s blind spots before RL turns them into a strategy.
Can a reward model improve itself?
REFORM lets the judge red-team itself: discover responses itmis-scores, then train on those mistakes.
#ACL2026 Oral https://arxiv.org/abs/2507.06419