/Tech2h ago

Developer Builds 52M-Parameter Decoder-Only TTS Model Predicting Mel Spectrograms

515158322.5K
Original post
ZD1908@ZDi____

It works! I made it a decoder-only TTS model. Text then mel with the head. This one is just 52M param, trained from scratch on LJSpeech (20 hours). The audio quality is shitty because I'm inverting the melspectrogram with Griffin-Lim.

ZD1908@ZDi____

I'm running an experiment. With AR transformers for speech, do we need a tokenizer, or can we get away with predicting mel spectrogram directly? This unconditional transformer predicts a latent which then goes into a 1D causal conv that predicts the next 4 mel frames.

9:52 PM · Jun 10, 2026 · 16.2K Views
Sentiment

Users are excited about the developer's 52M-parameter decoder-only TTS model predicting mel spectrograms because it opens promising paths for improvements like patchification and diffusion heads.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS6.3KBOOKMARKS25LIKES56REPLIES1
kache@yacineMTB

He's cooking

ZD1908@ZDi____

It works! I made it a decoder-only TTS model. Text then mel with the head. This one is just 52M param, trained from scratch on LJSpeech (20 hours). The audio quality is shitty because I'm inverting the melspectrogram with Griffin-Lim.

2hViews 6.3KLikes 56Bookmarks 25
寻水@fQAcAIfIY8U18DN

@ZDi____ someone did this a few years ago, it's called Transformer-TTS

19hViews 51

@ZDi____ Awesome, you should do a lot more patchification of the mel spectrogram then use a diffusion head like VAR and you'll have a great long-duration tts model.

18hViews 71Likes 1
ZD1908@ZDi____

@fQAcAIfIY8U18DN Transformer-TTS was encoder-decoder and I couldn't reproduce it. This is decoder only and the output head is a causal conv.

19hViews 37Likes 1