/Tech7h ago

GoodfireAI uses parameter decomposition to disable a model's German text prediction by fine-tuning on just four tokens

The method modifies weight matrices directly rather than manipulating activations.

781.3K74662146.5K

#72

Original post

Lucas Beyer (bl16)@giffmana#72inTech

That's pretty cool! They decompose the weight matrices into interpretable subsets (that part needs a lot more than 4 tokens iiuc) and then basically tune down the subset for German, which destroys German performance while keeping almost everything else intact (unlike lora)

Goodfire@GoodfireAI

We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.

As part of a 1-day hackathon with our product Silico, we removed a 67M-parameter language model's ability to predict German text, by tuning only a scalar factor on one subcomponent of the weights. (1/6)

12:33 PM · Jun 25, 2026 · 20K Views

Sentiment

Positive users praise GoodfireAI's minimal edits to erase an LLM's German ability as a cool precise technique, while negative users fear it will make targeted censorship far easier.

Pos

52.2%

Neg

47.8%

20 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS21.3KBOOKMARKS30LIKES320REPLIES8

kache@yacineMTB

now remove its ability to speak cuda

Goodfire@GoodfireAI

We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.

3h21.3K32030

RETWEETS2

Goodfire@GoodfireAI

This is an early demo of how parameter decomposition could enable targeted, predictable model editing.

Details on this experiment: https://www.lesswrong.com/posts/ieoWstubDQWLrMnhH/exploration-fine-tuning-with-parameter-decomposition

If you want to run experiments on your model too, learn more and request access to Silico: https://www.goodfire.ai/silico

7h2.8K4817

Goodfire@GoodfireAI

This was an early exploration in fine-tuning with *parameter decomposition* (see quote), our method which divides a model's weight matrices into interpretable, sparsely-activating components.

We picked German as it seemed to be the model's strongest non-English language. (2/6)

7h6.6K6212

Goodfire@GoodfireAI

We benchmarked vs LoRA fine-tuning. Our edit matched its German removal with far fewer tokens.

Strikingly, it also left other languages almost untouched.

The LoRAs often wrecked French, Spanish, Italian, and sometimes English, while our edit mostly left them alone. (3/6)

7h3.6K522

Luca Soldaini 🎀@soldni

I know of 3 tokens that will remove the ability to speak ANY language

torch.nn.init

Goodfire@GoodfireAI

We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.

3h2.3K373

Goodfire@GoodfireAI

@ZikuD_s It is! https://github.com/goodfire-ai/param-decomp

6h1.1K175

Goodfire@GoodfireAI

In a sense this is cheating: we're indirectly exploiting the tokens from when we did the parameter decomposition and interpreted the resulting subcomponents.

But if our decomposition is good, that cost can be amortized over arbitrarily many tasks & component edits. (4/6)

7h2.6K412

Goodfire@GoodfireAI

Plus, that interpretability lets us notice and fix problems.

E.g.: initially we tuned the top 16 German-related components, but their labels showed most were about foreign languages in general.

So we narrowed to the single component for German alone, improving precision. (5/6)

7h2.9K351

Goodfire@GoodfireAI

Correction: a plotting error caused the bars in the plot of off-target effects to display at 0.01 nats above the true means. The corrected plot is below:

6h2K201

Hisku@ZikuD_s

@GoodfireAI Very cool research! Wondering if the code is open source 🤔

6h1.2K31

Delip Rao e/σ@deliprao

@soldni say it with an Essex accent: torch.nn.innit

Luca Soldaini 🎀@soldni

I know of 3 tokens that will remove the ability to speak ANY language

torch.nn.init

1h7130

Piyush@CatAstro_Piyush

@GoodfireAI Super cool work!

7h6773

Patrick Helmig@phelmig

@GoodfireAI Tja. (Most important token in the process)

3h3092

Gregor@bygregorr

@GoodfireAI not sure '4 tokens on one subcomponent' = clean removal at 67M German and Dutch share enough latent subspace that suppressing one usually bleeds into the other. did you check Dutch or Swedish perplexity post-tune?

4h3011

Michał Piszczek@cdiamond

@GoodfireAI if one scalar deletes German, the same lever quietly deletes refusals. interp cuts both ways

4h4361

michi@ichrenndochnur

@GoodfireAI should‘ve removed french

5h2384

Yong Zheng-Xin@yong_zhengxin

@GoodfireAI this is super cool!

6h7463

janbam@janbamjan

@GoodfireAI @runaway_vol ach du liebe güte

5h204

Tanmoy Mukherjee@langer_han

@GoodfireAI Hey @GoodfireAI I do want to run some benchmarks on parameter decomposition. I did apply for silico but I havent yet received any info. By any chance could you upload the code on github/HF or perhaps make something which makes access easier

6h105

bioslopper@bioslopper

I feel like people don't understand how bad this is. Censoring models in post-training usually meant to suppress certain topic-related activations, but it was typically fairly easy to revert those changes and thus uncensor them.

If you now can decompose the model's weight into knowledge subsets and essentially surgically remove certain knowledge clusters, this will make model liberation impossible.

2h1883