it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao
i'm slightly apprehensive about the implications for toolcall results that are inherently multimodal in a vlm setup/contexts without unified multimodality i wonder if you can see limited gains via weird second order info ala "predicting DINO latent magnitude deltas" or whatever