Asymmetric clipping lets the model move more aggressively when reinforcing a correct-but-rare judgment, while staying conservative when walking back a bad one. +10.1% accuracy.
CISPO loss with asymmetric clipping, replacing standard importance sampling. In policy gradient RL, you reweight updates by the ratio of the new and old policy's probabilities for an action, then clip that ratio so a single update can't swing too far. Standard clipping is symmetric, with the same bound in both directions.
