It works! I made it a decoder-only TTS model. Text then mel with the head. This one is just 52M param, trained from scratch on LJSpeech (20 hours). The audio quality is shitty because I'm inverting the melspectrogram with Griffin-Lim.
I'm running an experiment. With AR transformers for speech, do we need a tokenizer, or can we get away with predicting mel spectrogram directly? This unconditional transformer predicts a latent which then goes into a 1D causal conv that predicts the next 4 mel frames.


