/AI4h ago

Expert Warns AI Executive Order Cyber Threshold May Cover All Models

517223.3K
Original post
Divyansh Kaushik@dkaushik96#849inAI

Just on cyber first, the EO is a good step but I’m a little worried about the benchmarking that NSA has to do to define a covered model by cyber capability. Say we set the bar at X, calibrated to today’s defenses. In 6-12 months X catches every model released (as defenses lag but other labs release capable models). Yes, X moves up eventually but in the near term we risk making every model covered. We may end up building a threshold that is no longer useful.

Divyansh Kaushik@dkaushik96

I fear some people still haven’t registered that Mythos/Mythos+ models aren’t cyber models but broadly capable. The cyber focus is much warranted (in fact long overdue) but we’re ignoring so much else. And I often worry that comes back to haunt us 6-12 months from now.

4:49 AM · Jun 9, 2026 · 421 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5KLIKES6
Divyansh Kaushik@dkaushik96

We’re going to be having the same conversations we had wrt cyber on bio, on autonomy/R&D/RSI, etc in the next few months—all while major labs are looking to go public. We need a lot more agility in governance and a lot more visibility to know what to govern.

Divyansh Kaushik@dkaushik96

This just gets to how hard it is to design a mechanism to address just one kind of threat. I go back to my original worry that most policymakers aren’t registering what’s truly happening here (which is exactly why institutions like caisi are all the more important to give us visibility into what to prepare for).

4hViews 1.5KLikes 6Bookmarks 1
BOOKMARKS1REPLIES1
Divyansh Kaushik@dkaushik96

And then the issue that this threshold is per-model. Microsoft’s MDASH found more vulnerabilities than any single frontier model by orchestrating hundreds of smaller ones together but none of those would be covered individually. So we leave a major aspect of the threat space uncovered.

Divyansh Kaushik@dkaushik96

Just on cyber first, the EO is a good step but I’m a little worried about the benchmarking that NSA has to do to define a covered model by cyber capability. Say we set the bar at X, calibrated to today’s defenses. In 6-12 months X catches every model released (as defenses lag but other labs release capable models). Yes, X moves up eventually but in the near term we risk making every model covered. We may end up building a threshold that is no longer useful.

4hViews 262Likes 1Bookmarks 1
RETWEETS1
Divyansh Kaushik@dkaushik96

What’s going to be our response on bio or autonomy? Are we willing to make hard decisions? We struggle with decision making already, and it’s getting harder to get things right in the first go.

Things will get weird and only about 200 or so people in DC are paying attention.

Divyansh Kaushik@dkaushik96

We’re going to be having the same conversations we had wrt cyber on bio, on autonomy/R&D/RSI, etc in the next few months—all while major labs are looking to go public. We need a lot more agility in governance and a lot more visibility to know what to govern.

3hViews 794Likes 5Bookmarks 0
Divyansh Kaushik@dkaushik96

This just gets to how hard it is to design a mechanism to address just one kind of threat. I go back to my original worry that most policymakers aren’t registering what’s truly happening here (which is exactly why institutions like caisi are all the more important to give us visibility into what to prepare for).

Divyansh Kaushik@dkaushik96

All that said a model’s capability scales with how much test-time compute you give it. Hand a weaker, uncovered model far more compute and it may find exploits the benchmark never priced in. So what are we actually measuring? The line moves the moment someone spends more.

4hViews 181Likes 3Bookmarks 0
Divyansh Kaushik@dkaushik96

All that said a model’s capability scales with how much test-time compute you give it. Hand a weaker, uncovered model far more compute and it may find exploits the benchmark never priced in. So what are we actually measuring? The line moves the moment someone spends more.

Divyansh Kaushik@dkaushik96

And then the issue that this threshold is per-model. Microsoft’s MDASH found more vulnerabilities than any single frontier model by orchestrating hundreds of smaller ones together but none of those would be covered individually. So we leave a major aspect of the threat space uncovered.

4hViews 116Likes 0Bookmarks 0