6h ago

Video AI Models Hallucinate Sounds from Visual Cues Instead of Listening

0
Original post

turns out, today's video models are mostly just pretending to listen. 👁️🎙️When Vision Speaks for Sound 👁️🎙️ We tested top models (including the new Gemini 3.5 Flash). They all suffer from the audio-visual Clever Hans effect: hallucinating sounds from visual cues instead of actually verifying audio. 🤯 🔇 Mute the audio? They still “hear” the crash. ⏱️ Shift the sound? They still say it is synced. 🔄 Swap the track? They often accept the mismatch. 🛠️ The fix? 10K pairs of targeted preference alignment. We boosted performance by 28 points—with zero drop in general video skills. More details on how we diagnose and fix this here 👇 📄 Paper: https://arxiv.org/abs/2605.16403 🔗 Website: https://rakanwen.github.io/when-vision-speaks-for-sound/ 🤗 Collection: https://huggingface.co/collections/Rakancorle1/when-vision-speaks-for-sound #videoaudio #videoaudioalignment #omnimodel

2:08 PM · May 19, 2026 View on X

Do you hear the people sing? Frontier models clearly do not, but hallucinate that they do.

We found that, surprisingly, leading omni-modality foundation models are terrible at understanding the audio track of videos, and takes the shortcut to assume that audio is always consistent with the video.

So, if you have a choir in the video -- they assume the audio is singing; if there appears to be a crash -- they hallucinate the impact noise; if you replace the sound track -- they still think it's perfectly fine and consistent with what they see.

Learn more in our new preprint.

Xiaofei WenXiaofei Wen@Xiaofei_Wen_Mk

turns out, today's video models are mostly just pretending to listen. 👁️🎙️When Vision Speaks for Sound 👁️🎙️ We tested top models (including the new Gemini 3.5 Flash). They all suffer from the audio-visual Clever Hans effect: hallucinating sounds from visual cues instead of actually verifying audio. 🤯 🔇 Mute the audio? They still “hear” the crash. ⏱️ Shift the sound? They still say it is synced. 🔄 Swap the track? They often accept the mismatch. 🛠️ The fix? 10K pairs of targeted preference alignment. We boosted performance by 28 points—with zero drop in general video skills. More details on how we diagnose and fix this here 👇 📄 Paper: https://arxiv.org/abs/2605.16403 🔗 Website: https://rakanwen.github.io/when-vision-speaks-for-sound/ 🤗 Collection: https://huggingface.co/collections/Rakancorle1/when-vision-speaks-for-sound #videoaudio #videoaudioalignment #omnimodel

9:08 PM · May 19, 2026 · 1.3K Views
12:24 AM · May 20, 2026 · 242 Views