So the line isn't "models shouldn't imagine." We want models that complete scenes, predict, simulate. The line is that the model needs to know, and tell us, which parts came from input and which parts it filled in.
None of this is specific to vision. A fabricated citation is the same move: a world model of the literature completing a gap with plausible authors and a plausible year. Ignoring retrieved context when priors are strong is another example.