We found that state-of-the-art VLMs (Gemini, GPT-5, etc.) fail at predicting task progress for online RL, so we built our own: SOLE-R1.
SOLE-R1 is trained on 10 million images and video frames, and 4 million chain of thought traces that reason over both space and time.
The result is a video-language reasoning model that can be used as a reward for online RL with no other reward signals!