/Tech4h ago

Leshem Choshen, IBM Research and MIT CSAIL postdoc, co-launches Evaluation Cards to track reproducible AI model evaluation data

The platform compares discrepant reported scores for the same model.

81874961

#984

Original post

EvalEval Coalition@evaluatingevals

🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀

Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇

10:22 AM · Jun 11, 2026 · 837 Views

/Tech4h ago

Leshem Choshen, IBM Research and MIT CSAIL postdoc, co-launches Evaluation Cards to track reproducible AI model evaluation data

The platform compares discrepant reported scores for the same model.

81874961

#984

Original post

EvalEval Coalition@evaluatingevals

🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀

Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇

10:22 AM · Jun 11, 2026 · 837 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS75RETWEETS1

Leshem (Legend) Choshen 🤖🤗@LChoshen

@evaluatingevals This is the work of so many people, but it doesn't mean you can't improve it. Any ideas? Feedback? Contributions (data)?

https://evalcards.evalevalai.com/

4h75

LIKES3

EvalEval Coalition@evaluatingevals

Pick a model.

You get every published benchmark result for it: the developer's own numbers & independent ones, shown separately. The temperature, max tokens, & harness behind each run. What each benchmark measures & its caveats. And a flag wherever reported scores disagree

4h373

REPLIES1

EvalEval Coalition@evaluatingevals

We launch a beta version because we want to build it together with you. Tell us what's broken, what's missing, and what would help your work:

🔗https://evalcards.evalevalai.com 📄 https://arxiv.org/abs/2606.09809 📜https://rb.gy/o6v9oh 🗺️https://changemap.co/evaleval/evalcards/

4h171

EvalEval Coalition@evaluatingevals

Why it matters:

The same model on the same benchmark often gets very different scores. Three organizations report GPT-5’s scores in MATH-500, ranging from 84.7-98.9%. A leaderboard shows you one of those numbers. We show you all of them, and what was different about each run.

4h252

EvalEval Coalition@evaluatingevals

Built in the open by the EvalEval Coalition, led by @evijit, @AnkaReuel, Jenny Chim, Matt Kennedy, with contributors from dozens of institutions, and built on top of previous efforts such as Every Eval Ever (also by the EvalEval coalition!) and Auto-BenchmarkCards.

4h261

EvalEval Coalition@evaluatingevals

Build evals? People can find your benchmark, see what it measures, and see a live leaderboard of results. No need for you to build separate infrastructure. Plus you get extra exposure. Study evals? You can access 100k+ results, each with a source, versioned by snapshot.

4h171

EvalEval Coalition@evaluatingevals

Massive thanks to all our co-authors and in particular top contributors @_srishtiyadav, @YananLong, @Andrxwtran, Jennifer Mickel, and others from 32 institutions and organizations, including 👇

4h141

EvalEval Coalition@evaluatingevals

What's in it for you: Build models? Earn user trust: your results sit in public next to everyone else's, independent evaluators can corroborate them, and submitting through your org's HF account marks them as verified. You can also contest other developers’ results.

4h19

EvalEval Coalition@evaluatingevals

Every result carries 4 signals, so you know how much to trust it:

- Reproducibility (can you rerun it?) - Completeness (is there enough context to interpret it?) - Provenance (who reported it, who confirmed it?) - Comparability (are diverging scores measuring the same thing?)

4h17

EvalEval Coalition@evaluatingevals

We also provide two reader views: A summary view if you just want to understand if you can trust a result, and a researcher view with the technical details: temperature, max tokens, harness, sources, and confidence intervals. Same data, adjusted to varying information needs.

4h17

EvalEval Coalition@evaluatingevals

@huggingface, @Stanford, @StanfordAILab, @SISLaboratory,@HooverInst, @QMUL, @TrustibleAI, @AiEleuther, @IowaState, @TUDarmstadt, @JWI_Berlin, @Harvard, @IBMResearch, @aigioxford, @ETH_en, @ETH_AI_Center,

4h6

EvalEval Coalition@evaluatingevals

@Oxfordinternet, @AmherstCollege, @UNLincoln, @mcgillu, @GeorgiaTech, @Mila_Quebec, @NotreDame, @Georgetown, @MIT

4h17