For what it's worth, I think this is the realm of "kind of disagreement that has strategic but not moral dimension," and I hope this is a domain of productive debate. But I really do think that your point of view on this is wrong, and a policy regime without FLOP thresholds is not automatically impractically onerous. I furthermore think a policy regime without them is necessary and inevitable.
There are just going to turn out to be standards of reasonableness for what models are acceptable or not to proliferate, how someone might reasonably be expected to know that, and how to test for them. People will not (and should not) have to put less-capable models through onerous risk review. Many of the checks needed might even be fully automatable, so that if someone makes or uses a model early on without putting it through a review, if their continued use of the model is challenged (e.g. they are sued for using a model that someone claims should have been tested before being put into service), they will be able to refute the challenge by getting a cheap automated test and proving that it had been fine to use in the first place.
If the argument is something like "well these tests are onerous today" then I still don't think that's a good basis for a policy threshold that only weakly correlates with the metric of interest. FLOPs tell me *some* things about models but really not very much. To reach for a (somewhat hyperbolic) analogy: it's similar to trying to measure someone's strength by their calorie consumption. Body-builders will have higher calorie consumption but so will a lot of other people who are not particularly strong. Would the metric weakly correlate? Yes. Is it a strong enough discriminator to know what's what? No.
I wrote this almost two years ago now... https://www.transformernews.ai/p/abandon-compute-thresholds-at-your
