Curious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳
Last year, we released IcePop to stabilize MoE RL with double-sided masking. As we dive deeper, something unexpected happened: the masking ratio went down, while the training–inference mismatch continued to grow!😞
This year, we introduce 𝑲𝑷𝒐𝒑🪩, which replaces the fixed ratio constraint with the binary KL divergence to adaptively mask inappropriate tokens! The masking ratio adapts to fluctuations of the training–inference gap during training, keeping policy optimization stable and effective with long-horizon agentic RL rollouts.
With this simple change, it enables our Ring-2.6-1T to achieve over 76 on the SWE-bench-Verified with pure RL training!
No modifications to infrastructure. No routing replay. Just one parameter, power your agentic RL with 𝑲𝑷𝒐𝒑!
Click to learn more about the details!
📜Blog: https://ringtech.notion.site/kpop