Open-source contributor Grad says applying length penalties to software engineering tasks first prevents redundant reinforcement learning training · Digg

Open-source contributor Grad says applying length penalties to software engineering tasks first prevents redundant reinforcement learning training · Digg

Posts from X

Most Activity

VIEWS6.4KBOOKMARKS36LIKES58RETWEETS5REPLIES3

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Xiaomi dedicates a separate paper to their MOPD method used for MiMo, which is similar to the industry standard as we've been seeing in DS, GLM etc. They frame it as a "capability integration paradigm". Mix-RL aka rawdogging is the 2nd best one. This addresses our post-V4 debate:

Rosinality@rosinality

https://arxiv.org/abs/2606.30406

OPD to combine multiple teachers. It is a baseline now. One detail could be whether token-level KL or top-K/full vocabulary distillation is better. (They found token-level KL works well enough.)

23h6.4K5836

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Chinese labs are iterating on OPD quickly

Francesco Bertolotti@f14bertolotti

Quite a jump from vanilla OPD. Here the authors use the difference between a privileged teacher and privileged student to compute a token-level advantage. The advantage is used to switch from weak/strong distillation modes.

🔗https://arxiv.org/abs/2606.30626

23h99275

Grad@Grad62304977

i mean even in the mimo paper isnt this still bullish on mix-RL? The SWE teacher seems equivalent to the mix with MOPD then spending more compute for the same performance. For math and IF it looks like if u combine the compute on the teacher and MOPD u could extrapolate to similar results, doesnt seem far fetched Also we are trusting qwen3 30B ablations too much, some evidence being the enigmata bytedance paper and internal results on bigger models generalising when smaller models didnt

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Whale Camp is winning… hearts and minds at least

5h858123

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

see here

23h74212

Grad@Grad62304977

One thing I don’t quite get on the PPO vs GRPO stuff is that PPO is much better for long horizon but shorter horizons it doesn’t matter much Argument being the longer the horizon, the more sparse the signal, the bigger the group size u need, while PPO can give u better signal for equal or cheaper compute. But it’s not obv to me that the value models job of giving the token level advantages is harder and harder the more long horizon u go

2h18901

stochasm@stochasticchasm

@Grad62304977 he's back to this website

5h1695

Grad@Grad62304977

@ar0cket1 It’s from https://arxiv.org/abs/2606.30406

2h2721

Grad@Grad62304977

Bcs u need to count the extra MOPD compute. In SWE even without that mix RL is the same For the others, if u added that MOPD compute on the mix RL I don’t think it’s far fetched that it would be equal or better Also these graphs defo have noise so would say the graph of mix RL could be much more similar to the expert in practice

2h591

stochasm@stochasticchasm

@Grad62304977 mixRL here is just RLing on many envs?

5h974

Grad@Grad62304977

@stochasticchasm ye

5h573

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

They find that PG and top-k distillations are comparable so long as you use same-origin teachers, and top-k is cheaper. But if you try to get cute with a different model, even same family… PG struggles, top-k totally collapses. I think this is why V4 went with full-vocabulary.

22h5281

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@Grad62304977 I'd say most nontrivial economically valuable tasks are multi-domain but if I were doing this and had infinite budget I'd even try multi-stage MOPD progressing from RL on atomic tasks to more integrated ones, each domain exquisitely tuned a taxonomy is a good question

4h811

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@Grad62304977 > Like do we fully think theres like 100 opus and codex experts I think there might be like 10 They give other reasons against MixRL as you see It might depend on how many "domains" you're covering. Perhaps the optimum is MixRL for scale-up, MOPD for scale-out

Grad@Grad62304977

For SWE u would argue here mixRL was better here. Also who says mixRL cant be as parallelised? You can parallelise Plus generalisation of bigger models Like do we fully think theres like 100 opus and codex experts, and that the generalisation between these domains is nothing? Like we think training for AI research would be a different expert even tho it requires good SWE, search, math, science, .... And why is it up for us to decide if it can generalise, so much inductive bias

4h3110

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@Grad62304977 notably not impressed by Minimax's result > GLM-5 GLM-5.2 straightforwardly does MOPD GRPO reasoning was pretty watertight at the time and the direct task (mafs). maybe people got carried away

Grad@Grad62304977

@teortaxesTex We know minimax doesnt, not sure if we know for kimi. GLM-5 did something more like cascade RL But i mean u could say "why did all the chinese labs move from PPO to GRPO"

3h2810

ar0cket1@ar0cket1

@Grad62304977 how is mixRL better in these figures, also what paper is this

2h271

Shuyao Xu@TimXu222575

I think MOPD is picking up, mostly because of organizational and management reasons.

In companies, you probably have different groups of people working on different domains, and they all want to train their expert models...If you do mix-RL, you will have to "All-reduce" many times

4h131

Lei Li@_TobiasLee

Our MOPD from MiMo-V2-Flash has been widely adopted in modern post-training pipelines.

Now the paper is out with more details & comparison.

Check it out: https://arxiv.org/abs/2606.30406

6h715101

Grad@Grad62304977

@teortaxesTex We know minimax doesnt, not sure if we know for kimi. GLM-5 did something more like cascade RL But i mean u could say "why did all the chinese labs move from PPO to GRPO"

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@Grad62304977 Why do you think every china lab does MOPD now? It's not like "it's China they have too many people", these groups are all pretty small. I suspect that this is derisking against hyperparameter failures or something

3h3400

Grad@Grad62304977

@teortaxesTex but why are u avoiding mix-RL We've embraced it for every other stage of training, why not here too?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@Grad62304977 I'd say most nontrivial economically valuable tasks are multi-domain but if I were doing this and had infinite budget I'd even try multi-stage MOPD progressing from RL on atomic tasks to more integrated ones, each domain exquisitely tuned a taxonomy is a good question

4h1700

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Might be of interest to @Grad62304977 @rawsh0 @stochasticchasm and others

22h4841