/Tech2h ago

Researcher Floats Discrete Autoencoder For VLM Tool Call Outputs

21500779

Original post

it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao

kalomaze@kalomaze

i'm slightly apprehensive about the implications for toolcall results that are inherently multimodal in a vlm setup/contexts without unified multimodality i wonder if you can see limited gains via weird second order info ala "predicting DINO latent magnitude deltas" or whatever

2:09 PM · Jun 11, 2026 · 480 Views

/Tech2h ago

Researcher Floats Discrete Autoencoder For VLM Tool Call Outputs

21500779

#1213

Original post

kalomaze@kalomaze#1213inTech

kalomaze@kalomaze

2:09 PM · Jun 11, 2026 · 480 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS299LIKES7REPLIES1

kalomaze@kalomaze

i am mainly thinking in the context of things like, say, CAD feedback loops where you want the agent to take some reference image and reconstruct it as openscad format, where's theres an inherent trial and error thing going on

kalomaze@kalomaze

2h29970