/Tech10h ago

Hugging Face CEO Clément Delangue argues post-training API guardrails are inadequate, proposing pre-training evaluations and staged releases

He warns brittle interfaces are easily bypassed through jailbreaks

16865237.3K

#109

Original post

clem 🤗@ClementDelangue#109inTech

Let’s face it: after-the-fact API guardrails are not the right safety tool for frontier models.

They don’t make dangerous capabilities disappear. They just hide them behind a brittle interface that can be easily jailbroken.

A better safety agenda:

- don’t train models for very high-risk capabilities without strong evals, justification, and containment

- use staged release, as pioneered by @IreneSolaiman, from trusted testers to broader access, and open release for transparency and accountability

- massively support open-source AI so the gap between players does not become so large that a few closed labs and governments end up with overwhelming capabilities and power over everyone else

- enable independent evaluation instead of asking everyone to trust a black-box API

- give law enforcement, courts, regulators, auditors, journalists, and civil society strong AI tools to detect, investigate, and hold accountable unlawful uses of AI

Safety means transparency, staged deployment, distributed power, and making sure democratic institutions can actually enforce the law.

2:39 PM · Jun 18, 2026 · 7.2K Views

Sentiment

Many users agree with the Hugging Face CEO's push for staged releases and open-source methods to improve AI safety, while others criticize API guardrails as ineffective and brittle.

Pos

62.5%

Neg

37.5%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Yong Zheng-Xin@yong_zhengxin

> don’t train models for very high-risk capabilities without strong evals, justification, and containment

how is this possible when many high-risk capabilities are dual-use in nature, and can be emergent from the capabilities users seek for? For isntance, Mythos cyberoffensive capabities are emergent from being good at coding

6h170

LIKES1

JD | RoyalCities@RoyalCities

@ClementDelangue Couldn't agree with you more but given the moves Anthropic has been making with the US admin I really hope Hugging Face has a plan B incase they target you guys and go after all of open source.

7h1051

REPLIES3

Stella Biderman@BlancheMinerva

@yong_zhengxin @ClementDelangue Are they? That’s an open empirical question in my mind that we don’t have any way of knowing the answer to.

4h46

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue That's from the report (tho it is unclear whether they deliberately train on patching vulnerabilitites – from the report and my chat with some folks, it seemed like they didnt).

But clearly exploitation is emergent.

4h34

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue > emergent

There's also this new work (https://arxiv.org/abs/2606.) that shows Kimi, DeepSeek, Qwen all have agentic vulnerability explotation ability, despite still not as good as close models.

4h32

Stella Biderman@BlancheMinerva

@yong_zhengxin @ClementDelangue That’s not empirical evidence for the claim. At the most generous it’s a summary of secret empirical evidence by someone with significant finacial and political investment in people having a particular belief about the technology.

4h25

Stella Biderman@BlancheMinerva

@yong_zhengxin @ClementDelangue More broadly, I’m yet to see a compelling argument that there exists a technology that is moral to allow a tech company to own, but not moral to allow me to own. If the technology is as dangerous as they claim, they shouldn’t be developing it.

4h22

Allen Schmaltz@Allen_Schmaltz

Yes, 2019-era logistic regression over the hidden-states is not the right approach. The problem is that such guardrails will themselves be brittle to distribution shifts, so the can just gets kicked down the road without actually fixing the problem.

Instead, what we need are robust, interpretable estimators of the predictive uncertainty via constraints in the feature-representation space. Essentially, we need to model language models as metric learners.

Simply put, we need Actionable Interpretability: Introspection of the predictions against instances with known labels; local updatability of conditional-branching decisions without a full re-training (as a mechanism complementary to in-context learning, retrieval, and tool-calling); and robust, approximately conditional estimators of the predictive uncertainty.

9h1631

Bran@Bran_Fi

@ClementDelangue Have to be careful with that too, there is so much mundane knowledge that in the right context could be considered high risk. Overly aggressive or politically aligned evals/containment will decimate generalization.

10h159

Behrad Khodayar@behradkhodayar

@ClementDelangue All r wise recommendations & appreciate it.

But, The started game can’t be unstarted.

10h125

Ferbin@Ferbin08

@ClementDelangue Seen this in robotics: train a system for a capability, you can't untrain it. You can wall it off, but walls get found. Only answer that's durable is not building the capability in the first place.

9h53

yash@yashetal

@ClementDelangue These are some clear perspective points for sure on how we should look at ai safety

10h50

Marcus@MarcusSpillane

@ClementDelangue API guardrails are seatbelts on a car with no brakes. We bolt them onto enterprise deployments and watch them snap inside a week. Real safety is training-time alignment and capability scoping. Bandaids don't ship to production.

6h37

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue I think it's actually in the best interest of Anthropic to say that it is not emergent, because this means that they can build a great cyberdefensive model without strong offensive capabilities.

4h28

Saylor@seylorra

@ClementDelangue the real safety question is what theyre training in the first place, not how they gate it after

9h23

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue https://arxiv.org/abs/2606.14295

4h18

Charles Foster@CFGeek

@ClementDelangue I agree with most of this!

3h17

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue I think it is also about whether there's enforceable oversights.

4h13

Yong Zheng-Xin@yong_zhengxin

@BlancheMinerva @ClementDelangue my second reread updates me towards they didn't train on patching. it's just from coding, reasoning and autonomy –– all the attributes that newer gen models are seeking to improve on.

4h11

RTK@RiverKhan

@ClementDelangue agreed! if a model is unsafe, then clearly a lab hasnt done enough evals + hardening. i think the rush to just get something out and fasttrack the ipo got to them

8h7