ok so minithesis smuggling in learned geometric biases into the optimization of RL tasks might be useful even if it means the network isn't literally optimizing the objective as we have defined or intended it in a pure pg sense, if only bc it constrains adaptation to a geometrically coherent space consider: a black box RLVR verifier that is systematically and deterministically wrong, in a way that is too arbitrary to learn without compressing an intractably large dictionary into the weights discriminative value estimators would be too "dumb" to compress that rule; instead, one would assume that they'd learn a smeared general-ish way of estimating what the verifier asks for
@teortaxesTex *sigh* value estimation smell real?