SuperBPE study shows tokenizer compression shapes LLM scaling laws
SuperBPE research finds that language model scaling laws depend on tokenizer compression rates. Higher compression reduces the compute-optimal ratio of training tokens to model parameters. The relationship stays consistent when measured as bytes per parameter across levels. The work concludes that tokenizers must be designed deliberately rather than treated as fixed components. Established LLM scaling laws prove sensitive to these tokenization choices.
In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]
Please see @TomLimi's thread & paper for all the cool findings. 🔍 Rather than being a static object, the tokenizer is something we can & should deliberately design as we scale up our models and runs!
In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.