/Tech3h ago

Learned Geometric Biases Enhance RL Task Optimization Despite Objective Deviations

439071.7K

Original post

ok so minithesis smuggling in learned geometric biases into the optimization of RL tasks might be useful even if it means the network isn't literally optimizing the objective as we have defined or intended it in a pure pg sense, if only bc it constrains adaptation to a geometrically coherent space consider: a black box RLVR verifier that is systematically and deterministically wrong, in a way that is too arbitrary to learn without compressing an intractably large dictionary into the weights discriminative value estimators would be too "dumb" to compress that rule; instead, one would assume that they'd learn a smeared general-ish way of estimating what the verifier asks for

kalomaze@kalomaze

@teortaxesTex *sigh* value estimation smell real?

5:30 PM · Jun 19, 2026 · 1.1K Views

Sentiment

Negative users in the replies hate evaluating learned geometric biases in RL optimization through the lens of sample efficiency, calling the approach prejudiced.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS363LIKES9REPLIES1

kalomaze@kalomaze

an example of this is a task that takes paragraphs, sorts sentences into some ~random order, asks the lm to reconstruct a valid one there's vastly many more wrong orders than correct orders, so pinning to the original order is sane - but the original order isn't uniquely determined by the input in a lot of cases the value estimator, in principle, learns a way to assign relative points that hedge as best as is possible given the incompressible constraints. that's probably the true value (heh) in explicit estimation; not some vague """variance reduction""" for """sample efficiency""" in the sense that the canon literature likes to invoke (else robotics people could just crank GRPO group size), but the smooth manifold geometry with inherently relative characteristics

kalomaze@kalomaze

3h36390

kalomaze@kalomaze

in general i just (sorry @hallerite) hate sample efficiency as a lens for this type of thing, to an almost prejudiced degree, because you're not getting any more or less "true bits" from learned value estimators, the only reason why a learned value estimator can do anything useful for you is because it learns structure in the geometric sense, much like a bottleneck representation the structure compounding nicely from a value head is, in some sense, equally as "wasteful" as GRPO informationally speaking, but forces it through a funnel that has invariant geometric characteristics i struggle to call that something that's more "sample efficient" in any meaningful sense, because it's not really a property of the reward function being """Learned Better""", but rather, the reward function being learned **to the extent that there is implied latent structure which is actually learnable** in the reward function

kalomaze@kalomaze

2h28860