/Tech2h ago

Researcher Floats Discrete Autoencoder For VLM Tool Call Outputs

21500779
Original post
kalomaze@kalomaze#1213inTech

it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao

kalomaze@kalomaze

i'm slightly apprehensive about the implications for toolcall results that are inherently multimodal in a vlm setup/contexts without unified multimodality i wonder if you can see limited gains via weird second order info ala "predicting DINO latent magnitude deltas" or whatever

2:09 PM · Jun 11, 2026 · 480 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS299LIKES7REPLIES1
kalomaze@kalomaze

i am mainly thinking in the context of things like, say, CAD feedback loops where you want the agent to take some reference image and reconstruct it as openscad format, where's theres an inherent trial and error thing going on

kalomaze@kalomaze

it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao

2hViews 299Likes 7Bookmarks 0