/Tech2h ago

Developer Builds 52M-Parameter Decoder-Only TTS Model Predicting Mel Spectrograms

515158322.5K

Original post

ZD1908@ZDi____

It works! I made it a decoder-only TTS model. Text then mel with the head. This one is just 52M param, trained from scratch on LJSpeech (20 hours). The audio quality is shitty because I'm inverting the melspectrogram with Griffin-Lim.

ZD1908@ZDi____

I'm running an experiment. With AR transformers for speech, do we need a tokenizer, or can we get away with predicting mel spectrogram directly? This unconditional transformer predicts a latent which then goes into a 1D causal conv that predicts the next 4 mel frames.

9:52 PM · Jun 10, 2026 · 16.2K Views

/Tech2h ago

Developer Builds 52M-Parameter Decoder-Only TTS Model Predicting Mel Spectrograms

515158322.5K

#403

Original post

ZD1908@ZDi____

9:52 PM · Jun 10, 2026 · 16.2K Views

Sentiment

Users are excited about the developer's 52M-parameter decoder-only TTS model predicting mel spectrograms because it opens promising paths for improvements like patchification and diffusion heads.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS6.3KBOOKMARKS25LIKES56REPLIES1

kache@yacineMTB

He's cooking

ZD1908@ZDi____

2h6.3K5625

寻水@fQAcAIfIY8U18DN

@ZDi____ someone did this a few years ago, it's called Transformer-TTS

19h51

Ryan Tremblay@zaptrem

@ZDi____ Awesome, you should do a lot more patchification of the mel spectrogram then use a diffusion head like VAR and you'll have a great long-duration tts model.

18h711

ZD1908@ZDi____

@fQAcAIfIY8U18DN Transformer-TTS was encoder-decoder and I couldn't reproduce it. This is decoder only and the output head is a causal conv.

19h371