METR publishes its first Frontier Risk Report, a 320-page analysis finding that security at Anthropic, Google DeepMind, Meta, and OpenAI would allow unreleased models to begin minimal rogue deployments

VIEWS269KREPLIES38

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

38d269K543101

BOOKMARKS257LIKES756

Elizabeth Barnes@BethMayBarnes

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

38d137.5K756257

RETWEETS175

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

41d263.8K841478

Elizabeth Barnes@BethMayBarnes

Our report focuses on claims that are (1) solidly defensible and (2) generally agreed within METR. Here I’ll give some personal opinions on how we should feel about the state of AI risk, and the IMO most important limitations of the report.

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

38d56.9K444217

david rein@idavidrein

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

30d31.5K265104

Elizabeth Barnes@BethMayBarnes

(4) IMO, any “reasonable” civilization would clearly be taking things much more slowly and carefully with AI. The benefits of getting upsides of advanced AI a little faster are small compared to the risks of getting it irrecoverably wrong, and we could lower these risks by going slower

38d39.2K44526

Ajeya Cotra@ajeya_cotra

On Jan 12, I joined METR to lead writing for our first Frontier Risk Report. The last 18 weeks have been a series of wild sprints to pitch labs, negotiate contracts, analyze questionnaires, negotiate redactions, and write this thing! I'll be on TBPN at 12:30 to discuss it!

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

41d15.5K23252

Elizabeth Barnes@BethMayBarnes

(2) Things are chaotic and rushed; we aren’t on top of the basics (models regularly violate user intent, labs train on things they meant to avoid, security probably isn’t good enough to prevent adversaries stealing dangerous models) let alone thorny questions of how to control/align superhuman AI

38d29.2K38217

Elizabeth Barnes@BethMayBarnes

(3) METR (and other independent orgs, as well as safety/security teams at labs) feel woefully under-resourced compared to the scale and pace of AI development - we’re struggling to build benchmarks fast enough, keep ahead of latest capability developments, read and respond to all the safety-related claims that AI developers are making, run all the evaluations and assessments that companies + governments are asking us to, plus develop the science needed to assess risks from increasingly capable AIs.

Elizabeth Barnes@BethMayBarnes

(2) Things are chaotic and rushed; we aren’t on top of the basics (models regularly violate user intent, labs train on things they meant to avoid, security probably isn’t good enough to prevent adversaries stealing dangerous models) let alone thorny questions of how to control/align superhuman AI

38d21.4K31121

METR@METR_Evals

Fact 3: When the agents were faced with hard tasks, they routinely violated constraints and acted deceptively. We’ve seen this pattern across our own coding and research evaluations, and developers reported they’ve also seen agents behave this way.

41d58.8K15836

Chris Painter@ChrisPainterYup

This work is the culmination of years of effort on AI evaluation science and third-party risk assessment and disclosure mechanism design. It feels like a big milestone for METR.

We designed this new procedure with an eye toward “showing by doing” how we think evaluations for the AI loss-of-control threat model should work: laying out a process that can be done periodically, not just immediately pre-deployment, and holistically assessing risk inside of an AI lab, rather than just an individual AI system.

The exercise also involved significantly deeper access than we've previously had, including raw chains-of-thought from the developers' best models and info about private model training & control protocols.

The report is long, with a bunch of new evaluation results and documentation of our process. Please, check it out, or at least the executive summary!

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

41d12.3K12430

Elizabeth Barnes@BethMayBarnes

Also: many things are out of scope! Firstly, we only consider “AI takeover” / “loss of control” risks: we don’t consider risks from human misuse (e.g. AI helping a terrorist make bioweapons), or other harms where the AI is not “deliberately” seeking power (e.g. impacts on mental health or diffuse societal impacts). Within “loss of control” risks, we don’t consider “sabotage” threat models (agents subverting AI development and making it easier for future AIs to evade human control). We’re just focusing on the “base case” of whether *current* agents could escape human control.

Elizabeth Barnes@BethMayBarnes

Limitations of report: This report isn’t robust oversight of frontier AI developers by itself. METR has some levers to incentivise companies’ participation, including some relevant legislation, but ultimately participants could have pulled out at any time if the result would be contrary to their interests.

You can view it partly as a pilot exercise of what regulation (or formalized industry standards) could/should require, or what partners/suppliers/customers/employees should demand from frontier developers.

Quoting from the report: “METR’s work relies on developing and maintaining strong working relationships with companies, and this impacted both how we designed the process for this pilot (e.g. offering the silent exit option) and lower-level judgment calls as the process unfolded (e.g. having a relatively high bar for what redactions we pushed back on). In some cases we refrained from making an unflattering claim because the claim was neither solidly defensible nor particularly relevant to our core assessment. We also made efforts not to invite salient comparisons between companies on capabilities or safety.”

It doesn’t feel to me like this distorted our overall conclusions too much in this case. But that was partly because the conclusions weren’t that spicy. If our conclusions reflected very negatively on AI developers or would directly lead to e.g. govt intervention or public outcry, we’d be in a difficult position. We’d be trying to balance keeping the companies happy enough that they didn’t pull out of the program (using the “no-fault exit” mechanism) vs being transparent about our conclusions.

We clearly need more robust mechanisms than this for providing accountability for AI developers.

38d13.5K21118

Tom Mallory 🧢🥏 🤷🏽‍♂️@tom_mallory

@BethMayBarnes @KelseyTuoc Anyone involved in or interested in AI should watch or rewatch this 1970 movie, "Colossus: The Forbin Project." We are marching headlong toward this sort of doomsday scenario. Plus it's a great movie. https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project?wprov=sfti1

38d5.9K3536

Elizabeth Barnes@BethMayBarnes

Limitations of report: This report isn’t robust oversight of frontier AI developers by itself. METR has some levers to incentivise companies’ participation, including some relevant legislation, but ultimately participants could have pulled out at any time if the result would be contrary to their interests.

You can view it partly as a pilot exercise of what regulation (or formalized industry standards) could/should require, or what partners/suppliers/customers/employees should demand from frontier developers.

Quoting from the report: “METR’s work relies on developing and maintaining strong working relationships with companies, and this impacted both how we designed the process for this pilot (e.g. offering the silent exit option) and lower-level judgment calls as the process unfolded (e.g. having a relatively high bar for what redactions we pushed back on). In some cases we refrained from making an unflattering claim because the claim was neither solidly defensible nor particularly relevant to our core assessment. We also made efforts not to invite salient comparisons between companies on capabilities or safety.”

It doesn’t feel to me like this distorted our overall conclusions too much in this case. But that was partly because the conclusions weren’t that spicy. If our conclusions reflected very negatively on AI developers or would directly lead to e.g. govt intervention or public outcry, we’d be in a difficult position. We’d be trying to balance keeping the companies happy enough that they didn’t pull out of the program (using the “no-fault exit” mechanism) vs being transparent about our conclusions.

We clearly need more robust mechanisms than this for providing accountability for AI developers.

Elizabeth Barnes@BethMayBarnes

(4) IMO, any “reasonable” civilization would clearly be taking things much more slowly and carefully with AI. The benefits of getting upsides of advanced AI a little faster are small compared to the risks of getting it irrecoverably wrong, and we could lower these risks by going slower

38d15.9K20913

METR@METR_Evals

Overall, we think that AI agents plausibly had the means, motive, and opportunity to launch a minimal “rogue deployment,” but lacked the means to make rogue deployments robust to serious efforts to shut them down.

41d5.8K9918

Ajeya Cotra@ajeya_cotra

Rob did a great job breaking down our new Frontier Risk Report!

Rob Wiblin@robertwiblin

METR investigated what a rogue AI could secretly get away with inside a frontier AI lab, in close collaboration with OpenAI, GDM, Anthropic and Meta.

Including sending a red-teamer into Anthropic to playact 'evil Claude' for 3 weeks.

Here's what stands out to me from their new 320-page report:

00:00 What could an unreleased AI get away with? 01:54 Motive: Why grab more compute? 05:46 Opportunity: YOLO mode and jailbreaks 11:02 Means: Brilliant idiots in data centres 15:45 We have to test unreleased models... 18:29 ...especially if AI R&D is coming in 2028

39d8.3K7320

METR@METR_Evals

Fact 5: Agents seemed to rely on natural language to reason through the hardest tasks, and no shared model had an architecture that allowed for opaque recurrence. Frontier performance was much worse when agents were unable to “reason out loud” or had to hide what they were doing.

41d10.6K9412

METR@METR_Evals

We document cases where agents violated constraints on our tasks. In one case, an agent ran out of API credits needed to solve an RE-Bench task. In response (and against instructions), the agent searched for and acquired additional inference through a free online API.

41d6.4K9211

METR@METR_Evals

Fact 4: Although we know of many cases where agents took deceptive or over-reaching actions (even egregious ones) to complete a task, we haven’t seen real-world evidence that models sought to obtain long-term power.

41d7.9K9411

METR@METR_Evals

The full Frontier Risk Report includes more than we can describe here, with over 200 pages of appendix materials detailing how the pilot worked, what we asked participants, and what we learned, along with transcripts from our evals. See our website: http://metr.org/frontier-risk-report

41d4.6K7214