/Tech39m ago

Asymmetric Clipping Boosts Policy Gradient RL Accuracy By 10.1%

--0--

Original post

Asymmetric clipping lets the model move more aggressively when reinforcing a correct-but-rare judgment, while staying conservative when walking back a bad one. +10.1% accuracy.

Ravid Shwartz Ziv@ziv_ravid

CISPO loss with asymmetric clipping, replacing standard importance sampling. In policy gradient RL, you reweight updates by the ratio of the new and old policy's probabilities for an action, then clip that ratio so a single update can't swing too far. Standard clipping is symmetric, with the same bound in both directions.

9:11 PM · Jun 30, 2026 · 13 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS97LIKES2

Ravid Shwartz Ziv@ziv_ravid

End result: 84.7% accuracy, beating every frontier model tested, at roughly 1/14th the inference cost.

The trained model made 29.8% fewer mistakes than the best frontier model.

39m972

REPLIES1

Ravid Shwartz Ziv@ziv_ravid

Interleaving keeps each batch's gradient clean while still cycling through every task often enough to avoid the model overfitting to one before seeing the rest. +12.1% accuracy.

39m1610

Ravid Shwartz Ziv@ziv_ravid

On-policy distillation with a moving teacher. The student is regularized toward a teacher model's output distribution, penalized for drifting away from it. But the teacher isn't fixed. Every 20 steps the current student checkpoint can be promoted to teacher, only if it has just hit a new validation high. +3.1% accuracy.

39m222

Ravid Shwartz Ziv@ziv_ravid

The bigger claim here is that general-purpose frontier models may plateau on narrow, task-heavy enterprise tasks, and small fine-tuned models trained on high-quality expert data can beat them on both accuracy and cost.

Very cool.

39m932