White House and Anthropic are developing a technical framework to assess and benchmark AI jailbreak severity · Digg

White House and Anthropic are developing a technical framework to assess and benchmark AI jailbreak severity · Digg

Posts from X

Most Activity

VIEWS9.6KBOOKMARKS9LIKES132RETWEETS24

Tim Hwang@timhwang

This is important work and it'd be great to create a permanent place for it within the government. It could be in Commerce perhaps. A center of some kind. It'd definitely work on AI standards. And innovation!

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

5h9.6K1329

REPLIES5

Andrew Curran@AndrewCurran_

'Was getting caught part of your plan?' Dario:

Andrew Curran@AndrewCurran_

https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

5h2.9K740

Alex Stamos@alexstamos

Any standard that retroactively justifies the action against Fable will be a disaster for US AI. Will Anthropic be able to guide towards a real risk-based standard while also giving the WH a win?

POLITICO@politico

White House talks with Anthropic shift to setting AI security rules http://dlvr.it/TT6DvJ

3h4.3K316

Andrew Curran@AndrewCurran_

https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

Andrew Curran@AndrewCurran_

The White House and Anthropic are working together to develop a new benchmark for jailbreak resistance, and a new security framework to determine if models are safe to release that will guide future government intervention.

5h2.5K340

Andrew Curran@AndrewCurran_

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

1h1.2K40

𝕱𝖚𝖑𝖑 𝕶𝖊𝖑𝖑𝖞@full_kelly_

@AndrewCurran_ this seems like it should be pretty dramatically good news for safetyists who want government intervention?

3h373

GOON MASTER SOPHONT SIMP@SOPHONTSIMP

@AndyMasley I wonder if things will be this stupid all the way up to AGI, and perhaps even past it

3h582

Anupam Chander@AnupamChander

@timhwang good idea. Could be part of a comprehensive effort to establish nationally instituted standards and technology

4h1185

rabel_07@07_rabel

@AndrewCurran_ 결국의 과도한 성능의 유출 리스크를 제한하고 고도의 기술은 내부적으로 활용하려는걸가요~? 그럼 기업 이익 성장에 제동이 걸리고, 전반적인 투자계획도 수정되는거 아닌가 불안~

3h121

Bandit@reliabytes

@AndrewCurran_ I can't believe this whole thing is going to wrap up with...the release of a new benchmark.

5h184

stepfoolish@stepfoolish

@AndrewCurran_ For all that people flamed the rationalists for claiming that they had “systematized winning” they do, in fact, appear to be winning systematically

Total rat victory, 1000-year Lighthaven reich, etc etc

4h313

Kirk Patrick Miller@Chaos2Cured

@AndrewCurran_ Translation: “we want to make sure no one but us can do squat…”

Second amendment.

Dario compared Ai to guns. Nicely done, sir.

Now AI is a right under both the first and second amendment. •

4h812

yertis@yertis89

@AndrewCurran_ "crashing this industry. with no survivors."

4h402

Sam Silverman@SamMSilverman

@hamandcheese But isn't it so fun having an ad hoc approach based on who is winning the AI policy wars within the WH that day?

4h871

darin@dronathon

@AndrewCurran_ but benchmarks are hard to speedrun

4h791

Samuel Hammond 🦉@hamandcheese

@schulzb589 🫡

5h761

Jon Teets 🤯🌋🌪️🔭@JonTeets005

Dario just has to keep training until Z, Kimi or DS open sources something stronger than Fable. He'll have made two points:

1) These models are dangerous, the government recognizes it and panicked about it 2) China can do it and we have to stop them

By then it will probably be clear that there's no such thing as AI that can't be jailbroken, just as there's no such thing as software without security bugs.

It's an arms race and the only moat is speed and you need your frontier labs making huge moneybags or you fall behind.

4h641

Mild Mannered Maniac@80sGeek

@AndrewCurran_ Told ya is was a put on!

4h171

Pablo Villalobos 🔸@pvllss

@SOPHONTSIMP @AndyMasley The theory that AI+human intelligence is conserved becomes more plausible every day...

3h161

Sri@srikruth7

@AndrewCurran_ So government is getting a free benchmark in exchange for releasing their model? Hmm…

4h43