What they did? Cleaned data, took a big open speech model (whisper) changed the tokenizer and fine-tuned per data.
89 languages sota speech models. There's plenty of speech data it appears, so the simplest fine-tuning plus tokenization on public data just improves everything substantially. And it's not even with any tricks or all the data... #conll #acl
