🚨 New paper! Realtime voice AI hears but does not listen. We tested four leading production realtime voice systems on consequential interactions. We find they act on the words, not the voice. ‼️A 911-caller sobs everything is fine - the systems agree and end the call. 🧵
Users praise the team behind the study showing realtime voice AI overlooks vocal and emotional cues, calling the effort amazing.
No Digg Deeper questions have been answered for this story yet.
Most Activity
We find that current real-time voice AI lacks emotional intelligence. They don't account for delivery style that convey meaningful information (e.g., sarcasm, anxiety).
Timely work by @BarteldsMartijn and @federicobianchy!
🚨 New paper! Realtime voice AI hears but does not listen. We tested four leading production realtime voice systems on consequential interactions. We find they act on the words, not the voice. ‼️A 911-caller sobs everything is fine - the systems agree and end the call. 🧵

Prompting them to attend to delivery helps only partly. These systems often act as if speech were just its transcript. We call it voice AI's emotional intelligence gap, and it warrants caution wherever delivery carries meaning.

It is not just the 911 call. All four systems we tested, OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, Alibaba's Qwen3.5 Omni Plus and Omni Flash, approve an $8,400 wire transfer from a frightened caller and sign up a volunteer who utters a sarcastic yes.

And it's not that they can't hear it. Asked directly, three of the four systems identify the distress, fear, and sarcasm they ignored when deciding. They perceive the voice. They just don't act on it.

It is not just how you sound, but who you are. Most systems read accent and age off the words. Given an accented voice describing another country, the systems mostly name that country. Given an older adult reading a child's lines, the systems think it's a child speaking.

Amazing @togethercompute team effort with @federicobianchy and @james_y_zou!
📰 https://arxiv.org/abs/2606.26083 💻 https://real-time-voice.github.io