Nous Research releases Token Superposition Training with 2-3× speedup

REPLY

@itsclivetime basically better initial weights? feel like its basically a warmup run of one query/key epoch over whole dataset

Clive Chan@itsclivetime

this is quite remarkable

7:57 PM · May 13, 2026 · 23.2K Views

10:46 PM · May 13, 2026 · 563 Views

QUOTE POST

#235will depue@WILLDEPUE

this is a really cool idea and incredibly elegant! great work from the nous team truly immediate thoughts (having not read the full paper so they probably answer these): - again, proof that there is so much low hanging fruit left in architecture/algorithms - if model computation is inherently sparse, this might also allow for an elegant inference win somewhere? you could persist training like this until the end (maybe you could vary bag size over training too, to make it robust to variable batch size) - i’m super curious what loss as a function of bag size is, how much capacity are you losing for N tokens - this suggests the model is able to internally disconnect merged sequences at each token, right? are they using any token sequence embedding? if not, that seems pretty free, might help a good amount - this might have interesting regularization effects for memorization: ex, you won’t memorize a batch of UUIDs/random strings so long as sequences shuffle each epoch, since you couldn’t split them apart. - i’m very curious what 99% TST training would perform like, with the minimum normal finetune at the end - you might want variable bag size and just anneal it down over training? like if the model is learning bigram/n-gram on step 1, then the penalty for large bag size might be lower than towards the end of training? just an idea - excited to see this on the nanogpt speedrun - curious about learning rate/optimization dynamics here, if i have batch size N split up across K number of bags, i surely should expect more variance for K << N vs K == N (normal training), and therefore smaller steps - this might be cool to try with ragged training, just yolo a bunch of awkwardly sized sequences in there and model just naturally adapts to varying numbers of tokens - weird angle, what about this at RL time? really depends how much adaptation you need. i’m reminded by the RL embeddings paper (where you can just drop the one hots and just embed the mixture). i’m just thinking about maintaining parallel chains of thought for the cost of just one, could be really cool but half developed thought

anyways back to turkey hunting

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

2:57 PM · May 14, 2026 · 3.2K Views

QUOTE POST

#256Teknium 🪽@TEKNIUM

Check out our researchers' latest paper that introduces Superposition, a potential path to multiplying training speed during pre-training.

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

6:42 PM · May 13, 2026 · 14K Views

QUOTE POST

#331François Fleuret@FRANCOISFLEURET

Awesome. LLM mixup weirdness.

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

8:37 AM · May 14, 2026 · 15.3K Views

QUOTE POST

#500Anjney Midha@ANJNEYMIDHA

very cool

a 2-3x speed up in training by essentially letting the model learn more flexibly in its early stages than rigid regimes

sort of akin to how homeschooling is much better for some kids than factory education

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

2:12 PM · May 14, 2026 · 20.9K Views

QUOTE POST

#559Clive Chan@ITSCLIVETIME

this is quite remarkable

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

7:57 PM · May 13, 2026 · 23.2K Views

QUOTE POST

#559Clive Chan@ITSCLIVETIME

someone in the comments pointed out similarity to this prior work, which i've never heard of before:

7:58 PM · May 13, 2026 · 615 Views

REPLY

#559Clive Chan@ITSCLIVETIME

tl;dr:

3x fewer steps iso-data, by pre-pre-training on a new objective: - segment of 8 tokens (mean pooled embeddings) ==> next segment of 8 tokens (multi-hot cross-entropy)

Clive Chan@itsclivetime

this is quite remarkable

7:57 PM · May 13, 2026 · 23.2K Views

8:10 PM · May 13, 2026 · 849 Views

QUOTE POST

#767elie@ELIEBAKOUCH

nice pre training work by nous claiming ~2.5x efficiency gains, building on previous research like MTP/SuperBPE. overall intuition is that at each step you want the model to process and predict more tokens

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 412.8K Views

5:47 PM · May 13, 2026 · 25.8K Views

QUOTE POST

#1120Leshem (Legend) Choshen 🤖🤗@LCHOSHEN

@eliebakouch

4:08 PM · May 14, 2026 · 108 Views