@Grad62304977 Short: 32K or less. Few pages long Long: 64/128K and higher, multiple rounds of tool calls or code execution
@recurseparadox What would u say is long horizon and short horizon?
Google DeepMind researcher Pranav Shyam drew a line in an X thread between simpler AI interactions capped at 32K tokens and the more demanding setups that kick in at 64K or 128K, where models must juggle repeated tool calls and code execution to keep state updated across steps.
@Grad62304977 Short: 32K or less. Few pages long Long: 64/128K and higher, multiple rounds of tool calls or code execution
@recurseparadox What would u say is long horizon and short horizon?
Short windows stay close to single-turn or bandit-style problems, while longer ones introduce sub-chains and shared knowledge that can benefit from value functions or Monte Carlo estimates.
Whether value models add meaningful signal or just latency in these extended settings is still being worked through, with some long-horizon cases showing zero learning signal under current approaches.
No Digg Deeper questions have been answered for this story yet.

@recurseparadox but whats the intuition here that tool calls and environments have a big effect here. As in its a clear point the value model can make a more accurate prediction on the final expected reward? Why is it different to say a long reasoning chain with a clear step made?

@recurseparadox hmm ya thats a fair point although i still feel say in math, there could be a natural segmentation of insights and steps the model took that are naturally shared across rollouts (maybe not an exact match but still)

Tool calls or code exec update the state of the MDP. If you have many states then maybe you know how to value at least some of them. This is where the sharing of knowledge happens between trajectories. The MDP states act like anchor points - they reappear in many trajectories and therefore you can use value function of one for the other. The value estimation of the full trajectory can still be very wrong but at least the model gets some reward (maybe the code produced was compiled but tests all failed for example. Here the value function can reward successful compilation because it has seen that in other successful trajectories)
If the state of the MDP is not changing then the problem is a bandit problem, and there’s nothing to share between trajectories anymore. Initial prompt is the only anchor point. In that case the policy knows as much as the value function.

The moment there are few tool calls/ environment returning state I think value model’s take over. They can smear reward in between the sub goals. I think value functions are most useful locally but for loong chain tasks.
Like you’re right in that value functions over very long horizons shouldnt be magically reliable. But I think their main benefit is in subtask rewards
@recurseparadox Ok ya fair. Wdyt abt the value models reliability at these different horizons
@Grad62304977 Short: 32K or less. Few pages long Long: 64/128K and higher, multiple rounds of tool calls or code execution