/Tech1d ago

Expert Warns AI Executive Order Cyber Threshold May Cover All Models

525244.9K
Original post
Divyansh Kaushik@dkaushik96#918inTech

Just on cyber first, the EO is a good step but I’m a little worried about the benchmarking that NSA has to do to define a covered model by cyber capability. Say we set the bar at X, calibrated to today’s defenses. In 6-12 months X catches every model released (as defenses lag but other labs release capable models). Yes, X moves up eventually but in the near term we risk making every model covered. We may end up building a threshold that is no longer useful.

Divyansh Kaushik@dkaushik96

I fear some people still haven’t registered that Mythos/Mythos+ models aren’t cyber models but broadly capable. The cyber focus is much warranted (in fact long overdue) but we’re ignoring so much else. And I often worry that comes back to haunt us 6-12 months from now.

4:49 AM · Jun 9, 2026 · 672 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.1K
Divyansh Kaushik@dkaushik96

We’re going to be having the same conversations we had wrt cyber on bio, on autonomy/R&D/RSI, etc in the next few months—all while major labs are looking to go public. We need a lot more agility in governance and a lot more visibility to know what to govern.

Divyansh Kaushik@dkaushik96

This just gets to how hard it is to design a mechanism to address just one kind of threat. I go back to my original worry that most policymakers aren’t registering what’s truly happening here (which is exactly why institutions like caisi are all the more important to give us visibility into what to prepare for).

1dViews 2.1KLikes 7Bookmarks 1
BOOKMARKS2LIKES7RETWEETS1
Divyansh Kaushik@dkaushik96

What’s going to be our response on bio or autonomy? Are we willing to make hard decisions? We struggle with decision making already, and it’s getting harder to get things right in the first go.

Things will get weird and only about 200 or so people in DC are paying attention.

Divyansh Kaushik@dkaushik96

We’re going to be having the same conversations we had wrt cyber on bio, on autonomy/R&D/RSI, etc in the next few months—all while major labs are looking to go public. We need a lot more agility in governance and a lot more visibility to know what to govern.

1dViews 1.2KLikes 7Bookmarks 2
REPLIES1
Divyansh Kaushik@dkaushik96

And then the issue that this threshold is per-model. Microsoft’s MDASH found more vulnerabilities than any single frontier model by orchestrating hundreds of smaller ones together but none of those would be covered individually. So we leave a major aspect of the threat space uncovered.

Divyansh Kaushik@dkaushik96

Just on cyber first, the EO is a good step but I’m a little worried about the benchmarking that NSA has to do to define a covered model by cyber capability. Say we set the bar at X, calibrated to today’s defenses. In 6-12 months X catches every model released (as defenses lag but other labs release capable models). Yes, X moves up eventually but in the near term we risk making every model covered. We may end up building a threshold that is no longer useful.

1dViews 487Likes 2Bookmarks 1
Divyansh Kaushik@dkaushik96

This just gets to how hard it is to design a mechanism to address just one kind of threat. I go back to my original worry that most policymakers aren’t registering what’s truly happening here (which is exactly why institutions like caisi are all the more important to give us visibility into what to prepare for).

Divyansh Kaushik@dkaushik96

All that said a model’s capability scales with how much test-time compute you give it. Hand a weaker, uncovered model far more compute and it may find exploits the benchmark never priced in. So what are we actually measuring? The line moves the moment someone spends more.

1dViews 248Likes 4Bookmarks 0
Divyansh Kaushik@dkaushik96

All that said a model’s capability scales with how much test-time compute you give it. Hand a weaker, uncovered model far more compute and it may find exploits the benchmark never priced in. So what are we actually measuring? The line moves the moment someone spends more.

Divyansh Kaushik@dkaushik96

And then the issue that this threshold is per-model. Microsoft’s MDASH found more vulnerabilities than any single frontier model by orchestrating hundreds of smaller ones together but none of those would be covered individually. So we leave a major aspect of the threat space uncovered.

1dViews 147Likes 2Bookmarks 0