/Tech4h ago

Chenglei Si and Noah Goodman introduce QuasiMoTTo, using correlated sampling to cut LLM test-time samples by up to 47%

It also reduces reinforcement learning training steps by 50%

151561910319.2K

#468

Original post

Michael Y. Li@michaelyli_

You're wasting FLOPs when scaling inference compute: by independently sampling parallel attempts, you burn compute rediscovering the same solutions.

Introducing QuasiMoTTo: we scale parallel sampling with correlated samples instead! These samples have higher coverage, are marginally exact draws from the LLM, and can be generated in parallel.

Result: same performance with 25-47% fewer samples in test-time scaling + 50% fewer training steps in RL!

In our new paper, we explore the design space of correlated samplers. Work with co-authors @probablynotaz9 (co-lead), @gandhikanishk, @noahdgoodman, and Emily Fox!

10:40 AM · Jul 2, 2026 · 16.9K Views

Sentiment

Users praised QuasiMoTTo's correlated sampling technique for cutting LLM inference compute 25-47% by reducing wasted independent samples while preserving output quality.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.7KBOOKMARKS17LIKES24RETWEETS7REPLIES2

noahdgoodman@noahdgoodman

stats nerds know: correlated samples with the same marginal distribution are magic. here we explore this magic mixed with llm magic!

Michael Y. Li@michaelyli_

You're wasting FLOPs when scaling inference compute: by independently sampling parallel attempts, you burn compute rediscovering the same solutions.

Introducing QuasiMoTTo: we scale parallel sampling with correlated samples instead! These samples have higher coverage, are marginally exact draws from the LLM, and can be generated in parallel.

Result: same performance with 25-47% fewer samples in test-time scaling + 50% fewer training steps in RL!

In our new paper, we explore the design space of correlated samplers. Work with co-authors @probablynotaz9 (co-lead), @gandhikanishk, @noahdgoodman, and Emily Fox!

4h2.7K2417

Michael Y. Li@michaelyli_

This paper applies ideas from the Monte Carlo literature to scaling inference compute and RL (shout-out to Art Owen and others!). This is a rich design space that we’re excited by!

Paper: https://arxiv.org/abs/2607.01179

10/n

4h7211

Michael Y. Li@michaelyli_

Test-time scaling results. QuasiMoTTo samples have higher coverage, achieving higher pass@k than i.i.d. sampling for the same budget k.

QuasiMoTTo even saturates a theoretical upper bound for any marginal-preserving sampler!

2/n

4h1571

Michael Y. Li@michaelyli_

RL results. In GRPO, by swapping out the i.i.d. sampler for a correlated sampler, we see sample efficiency improvements: same reward in fewer training steps.

Where do the gains come from? The increase in coverage reduces the percentage of zero-variance groups, boosting the effective batch size.

With i.i.d. sampling, rollouts in a group frequently coincide, so the group-relative advantage collapses to zero, providing no gradient signal.

3/n

4h1151

Michael Y. Li@michaelyli_

But how do you correlate samples from an LLM?

Two tricks: (1) Arithmetic coding + inverse-CDF sampling. N uniform points on [0,1] -> N rollouts. (2) Quasi-Monte Carlo (QMC) chooses those N points to be more evenly spread than i.i.d., but each point is uniform (marginally).

This arithmetic sampling approach was introduced in Vilnis et al. 2022 (arXiv:2210.15458) and for a review of arithmetic coding see David MacKay’s wonderful book.

4/n

4h851

Michael Y. Li@michaelyli_

Trick 1: inverse-CDF sampling + arithmetic coding

How to sample one discrete variable? Split [0,1] into bins with widths = probs, sample uniformly, see which bin you land in. For sequences? Same trick but use arithmetic coding, which maps every sequence to a subinterval of [0,1] whose length = probability of sequence under LLM.

In more detail: sample a uniform u. Split [0,1] into bins with sizes equal to first-token probabilities p(x_1). Step into the bin containing u and "collect" that token. Now repeat inside that bin with p(x_2|x_1), and so on… Since a uniform u lands in a sequence's subinterval with probability = interval length = sequence probability, this procedure yields exact samples.

5/n

4h721

Michael Y. Li@michaelyli_

Trick 2: high-coverage batch via randomized QMC.

We want samples spread apart, but each sample must be marginally uniform for inverse-CDF to be correct. How to do this? Toy example: sample u~Unif[0,1] and consider (u, 1−u). Each coordinate is uniform, but samples never collide.

We use randomized QMC to generate N uniform samples.

The lattice construction is intuitive. Generate points on the unit circle with fixed spacing and then rotate all points by the same randomly sampled angle. Rotation preserves relative distances, but makes each individual point marginally uniform, so pushing them through the inverse CDF yields correctly distributed samples.

6/n

4h651

Michael Y. Li@michaelyli_

Sampling is embarrassingly parallel. Intuitively, the LLM arithmetic code defines a trie. Given its own uniform u_i, each rollout i is a separate traversal of the same trie; there is no communication between rollouts. All coordination happens upfront from the N coupled uniforms.

7/n

4h621

vivek@vivekvajipey

@michaelyli_ super cool stuff

4h581

Michael Y. Li@michaelyli_

How does the choice of QMC sampler impact the results? Intuitively, there's a freedom vs. coverage tradeoff:

i.i.d. -> total freedom, weak coverage stratified -> points repel (can’t land in the same stratum), better coverage lattice -> one point determines all others, max coverage

9/n

4h561

Michael Y. Li@michaelyli_

The standard pass@k estimator assumes independent samples! We develop an analogous unbiased bootstrap estimator for QMC.

To estimate pass@k given N > k samples, we first construct a fine grid of N points, then subsample k < N points by skipping with an appropriate stride.

8/n

4h551

Devin Plumb@devin_plumb

@michaelyli_ Really interesting

4h372

Michael Y. Li@michaelyli_

This project started as a nerd snipe: we joked about reviving an older paper on antithetic sampling (http://arxiv.org/abs/1810.02555). Then we were reminded of the inverse-CDF trick while playing with vLLM — and then I randomly walked past Art Owen!

Also, shout-out to my awesome co-lead on this project (a co-term!) Anthony Zhan @probablynotaz9!!

Lastly, we build on these great papers: arithmetic sampling (arXiv:2210.15458) and CARMS (arXiv:2110.14002).

11/n

4h631

Scott Linderman@scott_linderman

@michaelyli_ Super cool! Congrats @michaelyli_ and team!

4h201

Michael Y. Li@michaelyli_

@vivekvajipey Thanks Vivek!

4h27

Aman@ixchio

@michaelyli_ bruh you took 'independent samples are wasteful' and quietly repurposed 80s era compression math into a free 47% compute refund the rest of us are still finger painting

4h11

مازن وذكاء الآلات@Mazen_AIEx

@noahdgoodman Yes, this is the kind of compute-efficient scaling I care about now. Same marginals, less wasted parallel sampling, more coverage. Very elegant, and quite important for test-time scaling and RL.