Pieter Levels highlights that AI voice generation systems fail to incorporate background noise or environmental reverb, even from leaders like ElevenLabs.
Audio quality also lags in AI video models behind photorealistic visuals.
——0——
QUOTE POST
#1884Ethan@TORCHCOMPILED
I’ve been wondering this and my best guess is either 1. Human perception of errors. Small errors in pixel intensities of images might go unnoticed whereas for audio it may be much more impactful 2. If diffusion is the dominant choice for modeling, it could be related to the typical spectra of audio vs images and some of the nuances around adding noise to slowly destroy signal
9:59 PM · May 19, 2026 · 346 Views