/Tech10h ago

Study Finds Realtime Voice AI Ignores Vocal Cues In Critical Calls

4291342.9K

Original post

🚨 New paper! Realtime voice AI hears but does not listen. We tested four leading production realtime voice systems on consequential interactions. We find they act on the words, not the voice. ‼️A 911-caller sobs everything is fine - the systems agree and end the call. 🧵

8:53 AM · Jun 25, 2026 · 2K Views

Sentiment

Users praise the team behind the study showing realtime voice AI overlooks vocal and emotional cues, calling the effort amazing.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS856BOOKMARKS2LIKES12RETWEETS5

James Zou@james_y_zou

We find that current real-time voice AI lacks emotional intelligence. They don't account for delivery style that convey meaningful information (e.g., sarcasm, anxiety).

Timely work by @BarteldsMartijn and @federicobianchy!

Martijn Bartelds@BarteldsMartijn

2h856122

REPLIES1

Martijn Bartelds@BarteldsMartijn

Prompting them to attend to delivery helps only partly. These systems often act as if speech were just its transcript. We call it voice AI's emotional intelligence gap, and it warrants caution wherever delivery carries meaning.

10h83

Martijn Bartelds@BarteldsMartijn

It is not just the 911 call. All four systems we tested, OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, Alibaba's Qwen3.5 Omni Plus and Omni Flash, approve an $8,400 wire transfer from a frightened caller and sign up a volunteer who utters a sarcastic yes.

10h116

Martijn Bartelds@BarteldsMartijn

And it's not that they can't hear it. Asked directly, three of the four systems identify the distress, fear, and sarcasm they ignored when deciding. They perceive the voice. They just don't act on it.

10h70

Martijn Bartelds@BarteldsMartijn

It is not just how you sound, but who you are. Most systems read accent and age off the words. Given an accented voice describing another country, the systems mostly name that country. Given an older adult reading a child's lines, the systems think it's a child speaking.

10h63

Martijn Bartelds@BarteldsMartijn

Amazing @togethercompute team effort with @federicobianchy and @james_y_zou!

📰 https://arxiv.org/abs/2606.26083 💻 https://real-time-voice.github.io

10h983