/Tech2h ago

AI engineer Yupo Niu argues RLVR-trained models derive their generalization from diverse training environments rather than inherent capabilities

Miles Brundage counters that skills like self-correction generalize inherently.

21500774

#44

Original post

1a3orn@1a3orn#1445inTech

RLVR-trained LLMs probably don't generalize "broadly" -- their broad intelligence comes from being trained on a huge diversity of RL envs.

However, Ant / OAI owning a huge diversity of RL envs will make it easier for them to study what algos *do* generalize broadly.

4:37 PM · Jun 8, 2026 · 608 Views

/Tech2h ago

AI engineer Yupo Niu argues RLVR-trained models derive their generalization from diverse training environments rather than inherent capabilities

Miles Brundage counters that skills like self-correction generalize inherently.

21500774

#44

Original post

1a3orn@1a3orn#1445inTech

RLVR-trained LLMs probably don't generalize "broadly" -- their broad intelligence comes from being trained on a huge diversity of RL envs.

However, Ant / OAI owning a huge diversity of RL envs will make it easier for them to study what algos *do* generalize broadly.

4:37 PM · Jun 8, 2026 · 608 Views

Sentiment

Users agree that RLVR enables weak out-of-domain generalization even on small models in studies of broad generalization for RL-trained LLMs.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

1a3orn@1a3orn

That is, having 10k such environments sets you up well to check what algos give you the best transfer from one 5k to the other 5k.

This is a reason that the fake generality of RLVR-trained LLMs doesn't rule out short timelines entirely; there's a RL env data usefulness overhang.

2h1432

LIKES3REPLIES1

Miles Brundage@Miles_Brundage

@1a3orn I think there's some of both going on. Things like self-correction, task decomposition, brainstorming etc. are generalizable skills that do apply to many tasks + don't require RL to be done on those specific tasks. But, they can certainly be refined much better per-task

1a3orn@1a3orn

That is, having 10k such environments sets you up well to check what algos give you the best transfer from one 5k to the other 5k.

This is a reason that the fake generality of RLVR-trained LLMs doesn't rule out short timelines entirely; there's a RL env data usefulness overhang.

2h13330

1a3orn@1a3orn

@Miles_Brundage Yeah I agree tbc; you can use RLVR to get (pretty weak) "out of domain" generalization even on very small models.

It's just that you can probably grow this cross-task several OOMs (or, OOMs of a hypothetical good measure for cross task transfer...)

https://arxiv.org/pdf/2509.25123

Miles Brundage@Miles_Brundage

2h7320