Video AI Models Hallucinate Sounds from Visual Cues Instead of Listening
Do you hear the people sing? Frontier models clearly do not, but hallucinate that they do.
We found that, surprisingly, leading omni-modality foundation models are terrible at understanding the audio track of videos, and takes the shortcut to assume that audio is always consistent with the video.
So, if you have a choir in the video -- they assume the audio is singing; if there appears to be a crash -- they hallucinate the impact noise; if you replace the sound track -- they still think it's perfectly fine and consistent with what they see.
Learn more in our new preprint.
turns out, today's video models are mostly just pretending to listen. 👁️🎙️When Vision Speaks for Sound 👁️🎙️ We tested top models (including the new Gemini 3.5 Flash). They all suffer from the audio-visual Clever Hans effect: hallucinating sounds from visual cues instead of actually verifying audio. 🤯 🔇 Mute the audio? They still “hear” the crash. ⏱️ Shift the sound? They still say it is synced. 🔄 Swap the track? They often accept the mismatch. 🛠️ The fix? 10K pairs of targeted preference alignment. We boosted performance by 28 points—with zero drop in general video skills. More details on how we diagnose and fix this here 👇 📄 Paper: https://arxiv.org/abs/2605.16403 🔗 Website: https://rakanwen.github.io/when-vision-speaks-for-sound/ 🤗 Collection: https://huggingface.co/collections/Rakancorle1/when-vision-speaks-for-sound #videoaudio #videoaudioalignment #omnimodel