
Tomorrow, we'll be publishing the full results, a replication dataset, and a log of every place reality wandered off from our beautiful preregistered protocol. (Confession is good for the soul, and for reviewer 2.)
No Digg Deeper questions have been answered for this story yet.

Tomorrow, we'll be publishing the full results, a replication dataset, and a log of every place reality wandered off from our beautiful preregistered protocol. (Confession is good for the soul, and for reviewer 2.)

And once again, this is not a best-practice manual and not a rulebook. It's a floor. There are often good reasons to skip certain practices, but the floor makes this into a choice you need to explain, instead of a thing you quietly hoped nobody mentions.

Of 32 candidate practices, 27 reached consensus as strongly recommended or required across groups. (As did 73 of 109 sub-items.)
For a field that can't agree on what "cyber capability" or "bio risk" even means, getting four stakeholder groups to agree on 27 things is great news.

The 27 are just the spine of a competent eval: define the construct, state the threat model, say what your metric actually means, quantify uncertainty, write down the rubric, log the runs, document the failures, ship a replication package or admit why you can't.

Piloting and readiness checks are a good illustration. Labs and industry liked piloting, but not externally verifiable readiness checks. AISIs and regulators wanted both. Auditors disliked both.
Same practice, different incentive structures, different answers.

The setup: a two-round Delphi across four groups (academics + civil society, AI labs + industry, auditors + third-party evaluators, AISIs + regulators). Preregistered protocol (36 pages of it, to keep us honest) with explicit consensus criteria and clear logs.

We had 121 participants, spanning the four categories, including people from dozens of universities, DeepMind, OpenAI, IBM, Intel, and Hugging Face - all not necessarily representing views of their employers. (And Anthropic folks politely declined due to a lack of time.)

As the results show, there are two thresholds, because we are nothing if not methodologically pedantic. Broad consensus = all groups at 50%+, Strong consensus = 2+ groups above 70% positive. (Where "positive" = rated strongly recommended, or required.)

This isn't enough to make an evaluation excellent, but it is all the stuff you'd be embarrassed to admit skipping, at least, without a good explanation.

For the 3 contested practices, the groups split: 1) prospective analysis with externally verifiable pre-run checks, 2) specific data splits and holdouts, and 3) piloting plus baseline calibration. These were valuable for some groups, but too expensive / burdensome for others.

The 5 that didn't make it are the interesting ones. 1 is trending toward inclusion (preregistering development and design plans). 3 are contested. 1 got no consensus whatsoever (a pre-use task summary with reporting commitments; evidently it seems good, but it's too burdensome).