💻Tired of running so many slow, expensive benchmark evals across every checkpoint?
Try ✨BenchPress✨ at https://microsoft.github.io/benchpress/: provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals.
How does this work? In his original post (https://x.com/DimitrisPapail/status/2026531440414925307), @DimitrisPapail first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones.
We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals.
Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points.
See more details below 🧵1/7
This work is with @DimitrisPapail at AI Frontiers, a boutique research lab inside @MSFTResearch.