🎉 SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS model built on a Qwen3-4B backbone. ✅ Zero-shot voice cloning + native streaming at 48 kHz stereo ✅ 31 languages, trained on ~4M hours of speech ✅ Duration control + explicit pause markup + long-form up to 10 min ✅ 5.976 req/s non-streaming at RTF 0.644, 1.75% WER (SeedTTS English, 2× GPU) ✅ Three-stage pipeline: reference encoding → AR engine → streaming vocoder, with frame-level CUDA Graphs
Cookbook: https://sgl-project.github.io/sglang-omni/cookbook/moss_tts_local.html Run it now with SGLang-Omni!
🤗 MOSS-TTS-Local Transformer v1.5 is now open source.
Built with a pure autoregressive Audio Tokenizer + LLM paradigm:
>MOSS-Audio-Tokenizer-v2, 2B params >Qwen3-4B backbone >Native 48 kHz stereo audio >Streaming output with theoretical sub-100 ms TTFT >Zero-shot voice cloning >Inline [pause] control >🇺🇸 🇯🇵 🇰🇷 31 language synthesis >SGLang-Omni Day0 support 🎉 @sgl_project @lmsysorg
Designed for voice agents, digital humans, game NPCs, audiobooks, and real-time speech generation.
👇




