1d ago

Tomek Korbak at OpenAI says safety research is harder to evaluate than capabilities work

OpenAI researcher Tomek Korbak described how reinforcement learning generalizes poorly on hard-to-verify tasks, making safety research itself more difficult to assess than capabilities research. Mikita Balesni noted that capabilities progress relies on a small set of shared frontier evaluations that labs apply consistently and refine publicly. Safety lacks equivalent standardized benchmarks beyond basic jailbreak resistance, forcing individual projects to build and validate their own metrics before comparisons can occur.

0
Original post

Many people are worried that AI agents are going to differentially underperform on safety research (even if they're not scheming) because (i) RL generalizes poorly to hard-to-verify tasks and (ii) AI safety research is harder to verify than AI capabilities research. What's the best evidence that for (ii)?

7:16 PM · May 17, 2026 View on X