/Tech4h ago

RL Training Produces Correct But Unclear Language Models

666124110.1K

Original post

Higher benchmark scores do not always mean better models for users.

Why? We claim that RL teaches LMs to be correct but not how to be correct: code can pass tests but be unreadable; explanations can be right but unclear.

How do we train LMs to be right in the right way?

(1/n)

12:21 PM · Jul 3, 2026 · 8.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.5KBOOKMARKS22LIKES29RETWEETS3REPLIES1

Jacob Andreas@jacobandreas

👉 New preprint (we had a big backlog 😅)!

Revisiting adversarial imitation learning for the era of RLVR:

Mehul Damani@MehulDamani2

Higher benchmark scores do not always mean better models for users.

Why? We claim that RL teaches LMs to be correct but not how to be correct: code can pass tests but be unreadable; explanations can be right but unclear.

How do we train LMs to be right in the right way?

(1/n)

4h3.5K2922

Mehul Damani@MehulDamani2

The key observation: in many tasks, rewards are not the only supervision we have. We also have examples of what good outputs look like.

Consider bug fixing: unit tests tell us if a fix works, while human patches show what good fixes look like.

RLVR and SFT fail in complementary ways here. RLVR fixes bugs but rewrites code from scratch; SFT learns style but misses correctness.

Can we learn from both signals jointly?

(2/n)

4h692

Mehul Damani@MehulDamani2

Introducing Verifiable Adversarial RL (VARL): a method that jointly learns from verifiable rewards and demonstrations.

How it works:

• co-train a discriminator to tell human from LM outputs • use verifier for correctness • combine both to reward correct, human-like outputs

4h582

Mehul Damani@MehulDamani2

A natural baseline is SFT + RLVR with KL: first learn from demonstrations, then use RL.

But token-level KL is a weak way to preserve SFT behavior. Outputs can stay locally close while drifting globally: collapsed stories, hacked tests.

VARL instead learns at the sequence level, and leads to better correctness–quality tradeoffs.

(9/n)

4h512

Mehul Damani@MehulDamani2

Across 3 domains, VARL preserves RLVR’s accuracy gains while improving non-verifiable qualities that RLVR breaks:

• fixes bugs with small, human-like edits • produces diverse and human-like stories • prevents reward hacking under flawed verifiers

Details below ⬇️

(4/n)

4h452

Mehul Damani@MehulDamani2

Bug fixing:

In bug fixing, SFT learns to make small, human-like patches, but fails to fix BUGS.

RLVR passes unit tests, but by rewriting code from scratch.

VARL improves correctness while keeping edits localized and human-like.

(5/n)

4h432

Mehul Damani@MehulDamani2

Reward hacking:

In Countdown-Code, the task is to solve an arithmetic puzzle while leaving the test suite unchanged.

But the verifier is flawed: it runs the LM’s own output test suite.

RLVR quickly learns to cheat by editing the tests.

VARL stays aligned with demonstrations and avoids hacking.

(7/n)

4h372

Mehul Damani@MehulDamani2

A key design choice: what should the discriminator see?

Raw LM outputs are natural, but can make it learn shortcuts: length, formatting, verbosity.

VARL can instead compare outputs in a user-chosen feature space—e.g., story summaries instead of raw stories—letting users inject domain knowledge when useful.

(8/n)

4h352

Mehul Damani@MehulDamani2

Story generation:

In story generation, we optimize win rate against an LLM judge.

RLVR improves win rate, but sacrifices diversity and human-like style.

VARL improves story quality while producing diverse, human-like generations.

(6/n)

4h361

Mehul Damani@MehulDamani2

Paper: https://arxiv.org/abs/2607.01181

w/ @ishapuri101, @IdanShenfeld , @jacobandreas

Huge thanks to @modal for supporting this work with a compute grant!

(n/n)

4h562