@krishnanrohit because it changes the semantics from the pov of the model and there's a known eval phenomenon of "text prompts given as vlm images cause overall performance/reasoning degradation compared to naive text prompts"
I don't understand this. If it is indeed that much cheaper and not just a mispricing, as the DS paper says it isn't, why aren't all the labs just doing this anyway in the backend to increase margins and cut prices?