METR publishes its first Frontier Risk Report assessing whether AI labs could lose control of advanced autonomous agents after testing internal models from Anthropic, Google, Meta, and OpenAI
Ajeya Cotra led the project after joining METR on January 12.
Too ubiquitous to METR
On Jan 12, I joined METR to lead writing for our first Frontier Risk Report. The last 18 weeks have been a series of wild sprints to pitch labs, negotiate contracts, analyze questionnaires, negotiate redactions, and write this thing! I'll be on TBPN at 12:30 to discuss it!
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
I am very grateful to METR for the huge effort of pulling together this report!
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
This work is the culmination of years of effort on AI evaluation science and third-party risk assessment and disclosure mechanism design. It feels like a big milestone for METR.
We designed this new procedure with an eye toward “showing by doing” how we think evaluations for the AI loss-of-control threat model should work: laying out a process that can be done periodically, not just immediately pre-deployment, and holistically assessing risk inside of an AI lab, rather than just an individual AI system.
The exercise also involved significantly deeper access than we've previously had, including raw chains-of-thought from the developers' best models and info about private model training & control protocols.
The report is long, with a bunch of new evaluation results and documentation of our process. Please, check it out, or at least the executive summary!
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
Progress!
But, uh, man, 12 weeks is... fast 😅

I'm excited about increasing transparency of frontier labs when it comes to loss of control risks, especially as we enter the early stages of RSI. METR does a great job coordinating this.
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
Incredible to see such thorough work done and reported in public; kudos to everyone involved, and who's working on making the field more robust based on this
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
Great step towards better risk assessment and external testing!
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
NOW: I’m on @MTSlive to talk about this!
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
Overall quite excited about this report!
But I wish it had quantitative risk estimates; using vague terminology rather than probabilities could lead to incorrect impressions of the report's implications, especially if the difference between 0.01% and 1% risk might matter a ton.
I think AIs are currently low enough risk that this isn't a huge deal for this report in particular, but it would be great to establish better norms for future risk assessments.

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
That's enough important and fascinating AI risk releases for today please. No more. Need to sleep at some point tonight.
On Jan 12, I joined METR to lead writing for our first Frontier Risk Report. The last 18 weeks have been a series of wild sprints to pitch labs, negotiate contracts, analyze questionnaires, negotiate redactions, and write this thing! I'll be on TBPN at 12:30 to discuss it!
once again, real CoTs are way more charming and diagnostic than the fake CoTs we get
I worked on the appendices for this report! They’re long and contain lots of wild stories of model behaviour - some of my favourites in this thread. (🧵)
our frontier risk report contains the most serious public assessment of AI capabilities pertinent to AI R&D acceleration to date.
it also makes clear how far the evidence base is from what might be achievable in future.
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, investigate performance on somewhat more open-ended tasks, and much more besides.

our frontier risk report contains the most serious public assessment of AI capabilities pertinent to AI R&D acceleration to date. it also makes clear how far the evidence base is from what might be achievable in future.
we have so far to go, both in terms of evidence on the level of AI capabilities today and what we might expect from AI systems in 3-12 months time.
the capabilities evidence feeds into our risk assessment. in the end, the gap between observed capabilities we are very confident AI systems have and those we are very confident they do not have is extremely wide.
it’s going to be a remarkable year for METR.
i could go on. a common theme is that *stronger evidence on AI R&D acceleration is possible but requires much more information.*
AI co system cards/risk reports are fine and all, but third-party risk assessments are clearly way more trustworthy. Very thoughtful work by METR.
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
Important work! And part of a trend towards AI risk assessments being periodic, focused on all frontier models of companies, rather than just happening before a new model is deployed.
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

