/AI7h ago

Boson AI releases Higgs Audio v3, a 4B-parameter chat-native TTS model that streams audio before sentences are fully generated

The model supports zero-shot voice cloning across 100 languages

--0--
Original postBanghua Zhu#1147
LMSYS Org@lmsysorg

πŸŽ‰ Meet Higgs Audio v3 TTS from @boson_ai, a ~4B chat-native TTS model for real-time voice agents. Day-0 support is live in SGLang-Omni!

> Low-latency streaming > 100 languages, single-digit WER/CER > Zero-shot voice cloning from a short clip > 20+ inline tokens for emotion, style & SFX > 14.74 req/s @ RTF 0.262 on 1Γ— H100

πŸ‘‰ Cookbook: http://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html Run it now with SGLang-Omni!

2:14 PM Β· Jun 4, 2026 Β· 2.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS1.8KBOOKMARKS17LIKES32REPLIES4
LMSYS Org@lmsysorg

πŸŽ™οΈ New blog: Higgs Audio v3 TTS on SGLang-Omni: Real-Time, Controllable Speech for Voice Agents

Higgs (@bosonai) is a ~4B chat-native TTS model (Qwen3-4B backbone, interleaved text + audio tokens) built for streaming voice: πŸŽ™οΈ Synthesis starts before a full sentence arrives 🌍 100 languages at single-digit WER/CER πŸ—£οΈ Zero-shot voice cloning from a short reference clip, across languages πŸŽ›οΈ 20+ inline control tokens for emotion, style, prosody & sound effects

⚑ Serving with SGLang-Omni Runs as a multi-stage pipeline: preprocessing β†’ audio encoder β†’ TTS engine β†’ vocoder βœ… CUDA Graph decode βœ… Async one-step-lookahead βœ… Batched vocoder & encoder βœ… RadixAttention partitioned by reference audio to amortize repeated cloning

πŸ“Š Performance: 14.74 req/s at RTF 0.262 with 16 concurrent requests on a single H100 (bf16)

7hViews 1.8KLikes 32Bookmarks 17
RETWEETS5
LMSYS Org@lmsysorg

πŸŽ‰ Meet Higgs Audio v3 TTS from @boson_ai, a ~4B chat-native TTS model for real-time voice agents. Day-0 support is live in SGLang-Omni!

> Low-latency streaming > 100 languages, single-digit WER/CER > Zero-shot voice cloning from a short clip > 20+ inline tokens for emotion, style & SFX > 14.74 req/s @ RTF 0.262 on 1Γ— H100

πŸ‘‰ Cookbook: http://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html Run it now with SGLang-Omni!

7hViews 2.3KLikes 28Bookmarks 15