/Tech4h ago

EvalConsensusAI To Publish Full Results And Replication Dataset Tomorrow

1622828

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Evals-Consensus.ai@EvalConsensusAI

Tomorrow, we'll be publishing the full results, a replication dataset, and a log of every place reality wandered off from our beautiful preregistered protocol. (Confession is good for the soul, and for reviewer 2.)

4h393

REPLIES1

Evals-Consensus.ai@EvalConsensusAI

And once again, this is not a best-practice manual and not a rulebook. It's a floor. There are often good reasons to skip certain practices, but the floor makes this into a choice you need to explain, instead of a thing you quietly hoped nobody mentions.

4h18

Evals-Consensus.ai@EvalConsensusAI

Of 32 candidate practices, 27 reached consensus as strongly recommended or required across groups. (As did 73 of 109 sub-items.)

For a field that can't agree on what "cyber capability" or "bio risk" even means, getting four stakeholder groups to agree on 27 things is great news.

4h21

Evals-Consensus.ai@EvalConsensusAI

The 27 are just the spine of a competent eval: define the construct, state the threat model, say what your metric actually means, quantify uncertainty, write down the rubric, log the runs, document the failures, ship a replication package or admit why you can't.

4h18

Evals-Consensus.ai@EvalConsensusAI

Piloting and readiness checks are a good illustration. Labs and industry liked piloting, but not externally verifiable readiness checks. AISIs and regulators wanted both. Auditors disliked both.

Same practice, different incentive structures, different answers.

4h14

Evals-Consensus.ai@EvalConsensusAI

The setup: a two-round Delphi across four groups (academics + civil society, AI labs + industry, auditors + third-party evaluators, AISIs + regulators). Preregistered protocol (36 pages of it, to keep us honest) with explicit consensus criteria and clear logs.

4h14

Evals-Consensus.ai@EvalConsensusAI

We had 121 participants, spanning the four categories, including people from dozens of universities, DeepMind, OpenAI, IBM, Intel, and Hugging Face - all not necessarily representing views of their employers. (And Anthropic folks politely declined due to a lack of time.)

4h13

Evals-Consensus.ai@EvalConsensusAI

As the results show, there are two thresholds, because we are nothing if not methodologically pedantic. Broad consensus = all groups at 50%+, Strong consensus = 2+ groups above 70% positive. (Where "positive" = rated strongly recommended, or required.)

4h12

Evals-Consensus.ai@EvalConsensusAI

This isn't enough to make an evaluation excellent, but it is all the stuff you'd be embarrassed to admit skipping, at least, without a good explanation.

4h12

Evals-Consensus.ai@EvalConsensusAI

For the 3 contested practices, the groups split: 1) prospective analysis with externally verifiable pre-run checks, 2) specific data splits and holdouts, and 3) piloting plus baseline calibration. These were valuable for some groups, but too expensive / burdensome for others.

4h9

Evals-Consensus.ai@EvalConsensusAI

The 5 that didn't make it are the interesting ones. 1 is trending toward inclusion (preregistering development and design plans). 3 are contested. 1 got no consensus whatsoever (a pre-use task summary with reporting commitments; evidently it seems good, but it's too burdensome).

4h9