/AI1d ago

Velma Voice Model Analyzes Raw Audio For Toxicity Without Transcription

56068910.9K

#1748

Original post

Santiago@svpino#1748inAI

I've built two voice pipelines for two different companies.

They both look like this:

Audio → STT → Clean transcript → NLP → Classify → Act

This works, but there's still a problem I can't solve.

Every time I convert audio to text, I'm keeping the words but throwing away the meaning. Tone, hesitation, sarcasm, and stress are all gone. I have the text, but miss its soul.

The folks at @modulate_ai reached out and showed me how to solve this.

Velma is the voice model that's been running inside Call of Duty and GTA Online to catch toxicity in real time.

This model skips the transcript entirely and works directly on the raw audio. This allows the model to take into account the "invisible clues" other models miss.

It can detect up to 150 invisible clues that none else does!

You can access Velma through an API, and it's ~10x cheaper than pushing audio through an LLM.

If you want to give it a try, use this link to get 1,000 free credits:

http://modulate.ai/api/velma?utm_source=x&utm_medium=influencer&utm_campaign=velmaapi&utm_term=socialpost&utm_content=santiago

Thanks to the team for partnering with me on this post.

6:35 AM · Jun 5, 2026 · 10.9K Views