/Tech8d ago

FAR.AI releases TamperBench, finding that safety training in open-weight LLMs can be stripped in a few hundred fine-tuning steps

Existing defensive methods largely fail to prevent model tampering.

--0--

#309

Original post

Dylan HadfieldMenell#309

FAR.AI@farairesearch

Open-weight LLMs ship with safety training that can be stripped in a few hundred fine-tuning steps. Can current defenses stop this?

We built and open-sourced TamperBench, the first unified framework for evaluating tamper resistance, and the answer is mostly no. 1/7

10:00 AM · Jun 4, 2026 · 3.9K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.2KBOOKMARKS11LIKES23

Cas (Stephen Casper)@StephenLCasper

Having worked in the space for a few years, I can definitively say that the most troublesome thing about working on open model safety has been the benchmarking. Glad tamper bench can help.

FAR.AI@farairesearch

Open-weight LLMs ship with safety training that can be stripped in a few hundred fine-tuning steps. Can current defenses stop this?

We built and open-sourced TamperBench, the first unified framework for evaluating tamper resistance, and the answer is mostly no. 1/7

8d2.2K2311

RETWEETS14

FAR.AI@farairesearch

Open-weight LLMs ship with safety training that can be stripped in a few hundred fine-tuning steps. Can current defenses stop this?

We built and open-sourced TamperBench, the first unified framework for evaluating tamper resistance, and the answer is mostly no. 1/7

8d3.9K2717

REPLIES2

FAR.AI@farairesearch

We evaluate 21 models up to 8B scale. Every model tested can be fine-tuned within 40 attack hyperparameter trials to consistently answer harmful StrongREJECT questions while preserving MMLU capability. None of 7 tested defenses fare any better. 4/7

8d451

FAR.AI@farairesearch

Given the importance of tamper-resistance to open-weight model safety, we hope TamperBench will help researchers rigorously develop techniques that advance open-weight model defenses. Our code is open-sourced! We welcome the addition of new defenses and metrics. Paper: https://arxiv.org/pdf/2602.06911 Code: https://github.com/criticalml-uw/TamperBench 5/7

8d4011

FAR.AI@farairesearch

Research by: Saad Hossain, @NayeemaNonta, and @siri_r of the CriticalML lab at @WaterlooENG; @tomhmtseng, @MatthewKowal9, and @KellinPelrine of @farairesearch; @psyonp,@SimkoSamuel, and @ZhijingJin of @JinesisLab; @SamVajpayee of @UofTCompSci; and @StephenLCasper of @MIT_CSAIL. 6/7

8d1072

FAR.AI@farairesearch

Tampering means modifying a model's weights or latent representations. As open-weight LLMs grow increasingly powerful, so does the danger of someone tampering away the safety guardrails and releasing a highly harmful variant.

Many defenses, e.g., TAR, claim resistance to tampering, but each study uses different attacks, data, and metrics, so it's hard to compare and know what actually works. 2/7

8d2941

FAR.AI@farairesearch

TamperBench addresses this.

It’s a toolkit with attacks (standard fine-tuning, LoRA, jailbreak-tuning, latent-space embedding attacks…), defenses (TAR, T-Vaccine, SDD, …), safety and utility metrics, and hyperparameter sweeps via Optuna. 3/7

8d1081

FAR.AI@farairesearch

@NayeemaNonta @siri_r @WaterlooENG @tomhmtseng @MatthewKowal9 @KellinPelrine @psyonp @SimkoSamuel @ZhijingJin @JinesisLab @SamVajpayee @UofTCompSci @StephenLCasper 🤝 Questions or ideas? hello@far.ai 🚀 We're hiring: http://www.far.ai/careers 7/7

8d1072

Richard Korzekwa@WeakInteraction

@farairesearch Do you expect similar results with larger models?

8d2