/Tech4h ago

Leshem Choshen, IBM Research and MIT CSAIL postdoc, co-launches Evaluation Cards to track reproducible AI model evaluation data

The platform compares discrepant reported scores for the same model.

81874961
Original post
EvalEval Coalition@evaluatingevals

🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀

Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇

10:22 AM · Jun 11, 2026 · 837 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS75RETWEETS1

@evaluatingevals This is the work of so many people, but it doesn't mean you can't improve it. Any ideas? Feedback? Contributions (data)?

https://evalcards.evalevalai.com/

4hViews 75
LIKES3
EvalEval Coalition@evaluatingevals

Pick a model.

You get every published benchmark result for it: the developer's own numbers & independent ones, shown separately. The temperature, max tokens, & harness behind each run. What each benchmark measures & its caveats. And a flag wherever reported scores disagree

4hViews 37Likes 3
REPLIES1
EvalEval Coalition@evaluatingevals

We launch a beta version because we want to build it together with you. Tell us what's broken, what's missing, and what would help your work:

🔗https://evalcards.evalevalai.com 📄 https://arxiv.org/abs/2606.09809 📜https://rb.gy/o6v9oh 🗺️https://changemap.co/evaleval/evalcards/

4hViews 17Likes 1
EvalEval Coalition@evaluatingevals

Why it matters:

The same model on the same benchmark often gets very different scores. Three organizations report GPT-5’s scores in MATH-500, ranging from 84.7-98.9%. A leaderboard shows you one of those numbers. We show you all of them, and what was different about each run.

4hViews 25Likes 2
EvalEval Coalition@evaluatingevals

Built in the open by the EvalEval Coalition, led by @evijit, @AnkaReuel, Jenny Chim, Matt Kennedy, with contributors from dozens of institutions, and built on top of previous efforts such as Every Eval Ever (also by the EvalEval coalition!) and Auto-BenchmarkCards.

4hViews 26Likes 1
EvalEval Coalition@evaluatingevals

Build evals? People can find your benchmark, see what it measures, and see a live leaderboard of results. No need for you to build separate infrastructure. Plus you get extra exposure. Study evals? You can access 100k+ results, each with a source, versioned by snapshot.

4hViews 17Likes 1
EvalEval Coalition@evaluatingevals

Massive thanks to all our co-authors and in particular top contributors @_srishtiyadav, @YananLong, @Andrxwtran, Jennifer Mickel, and others from 32 institutions and organizations, including 👇

4hViews 14Likes 1
EvalEval Coalition@evaluatingevals

What's in it for you: Build models? Earn user trust: your results sit in public next to everyone else's, independent evaluators can corroborate them, and submitting through your org's HF account marks them as verified. You can also contest other developers’ results.

4hViews 19
EvalEval Coalition@evaluatingevals

Every result carries 4 signals, so you know how much to trust it:

- Reproducibility (can you rerun it?) - Completeness (is there enough context to interpret it?) - Provenance (who reported it, who confirmed it?) - Comparability (are diverging scores measuring the same thing?)

4hViews 17
EvalEval Coalition@evaluatingevals

We also provide two reader views: A summary view if you just want to understand if you can trust a result, and a researcher view with the technical details: temperature, max tokens, harness, sources, and confidence intervals. Same data, adjusted to varying information needs.

4hViews 17
EvalEval Coalition@evaluatingevals

@huggingface, @Stanford, @StanfordAILab, @SISLaboratory,@HooverInst, @QMUL, @TrustibleAI, @AiEleuther, @IowaState, @TUDarmstadt, @JWI_Berlin, @Harvard, @IBMResearch, @aigioxford, @ETH_en, @ETH_AI_Center,

4hViews 6
EvalEval Coalition@evaluatingevals

@Oxfordinternet, @AmherstCollege, @UNLincoln, @mcgillu, @GeorgiaTech, @Mila_Quebec, @NotreDame, @Georgetown, @MIT

4hViews 17