Rank-2 Approximation Explains Most Variance in Model-Benchmark Matrix

Rank-2 Approximation Explains Most Variance in Model-Benchmark Matrix · Digg

Posts from X

Most Activity

VIEWS606BOOKMARKS2LIKES2

alphaXiv@askalphaxiv

read more: https://www.alphaxiv.org/abs/2606.24020

4h60622

RETWEETS10

alphaXiv@askalphaxiv

You Don’t Need to Run Every Eval

LLM eval suites look like dozens of independent benchmarks, but this paper shows the public score matrix is basically rank-2.

That means most model scores across 133 benchmarks can be predicted from just a few probe evals.

This paper, BenchPress, uses low-rank matrix completion to recover missing scores within about 4.6 points, and five carefully chosen benchmarks can predict the rest of a scorecard within 3.93 points.

So evals may be less like running every test, and more like measuring the right few coordinates of model capability.

4h4.4K10160

REPLIES1

Microsoft AI Frontiers@ms_aifrontiers

Collect model-by-benchmark scores into one matrix and look at its structure. On a fully audited 84-model × 133-benchmark matrix, it's effectively rank-2. That low rank is what makes the missing scores recoverable.

3h25

Microsoft AI Frontiers@ms_aifrontiers

Most LLM benchmark scores are predictable before you ever run them. New from the MS AI Frontiers team: BenchPress. The 84-model × 133-benchmark score matrix turns out to be effectively rank-2, so matrix completion fills in the rest. 5 probes recover a model's whole profile. Paper, code, demo 👇

3h1.2K102

Microsoft AI Frontiers@ms_aifrontiers

It also picks good seed sets. {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers a model's public profile to ~3.93 MedAE. A cheaper set lands at 4.55. Both beat random. Plus ranking, forecasting new releases, and calibrated confidence intervals.

3h67

Microsoft AI Frontiers@ms_aifrontiers

Running every benchmark on every checkpoint is slow and expensive. New work from the MS AI Frontiers team asks: do you even need to? BenchPress predicts benchmark scores without running them. 👇

3h30

Yifei Wang@WangYw251

Interesting work!

Yuchen Zeng@yzeng58

💻Tired of running so many slow, expensive benchmark evals across every checkpoint?

Try ✨BenchPress✨ at https://microsoft.github.io/benchpress/: provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals.

How does this work? In his original post (https://x.com/DimitrisPapail/status/2026531440414925307), @DimitrisPapail first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones.

We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals.

Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points.

See more details below 🧵1/7

This work is with @DimitrisPapail at AI Frontiers, a boutique research lab inside @MSFTResearch.

11h1.1K21

Microsoft AI Frontiers@ms_aifrontiers

BenchPress uses matrix completion (logit transform + bias-decomposed ALS) to predict the unobserved scores. Reveal a few of a model's scores and error on the rest drops fast, often under 2 points.

3h13

Maxence Frenette@maxencefrenette

@rosinality Seems like just a replication of https://arxiv.org/abs/2512.00193

1d1011

Microsoft AI Frontiers@ms_aifrontiers

Excellent work by @yzeng58 and @DimitrisPapail 📄 http://arxiv.org/abs/2606.24020 💻 http://github.com/microsoft/benchpress 🤗 http://huggingface.co/datasets/microsoft/benchpress-score-matrix 🔗 http://microsoft.github.io/benchpress

3h65

Yuchen Zeng@yzeng58

@WangYw251 Thanks so much!

3h53

Alexa | Startup founder@alexabelonix

@askalphaxiv very well done.

4h40

Adel Bucetta@adelbucetta

@askalphaxiv that's exactly what we did at http://kaigen.ai when building our ai agents for autonomous investing: collapsed the eval matrix into key probes, saved countless hours and still got market-beating results

2h26

haber twit@twittenhaber

@askalphaxiv curious what the rank-2 dimensions actually are — instruction-following and raw knowledge would be way too neat, bet there's something weirder hiding in there

2h19