Nous Research releases Contrastive Neuron Attribution, a method that steers LLM behavior by ablating the top 0.1% of MLP neurons identified via contrastive prompts without changing model weights

VIEWS5.7KLIKES55REPLIES2

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This will be huge for roleplay

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

24d5.7K5515

BOOKMARKS26RETWEETS4

Aryaman Arora@aryaman2020

nice, additional evidence towards claims in our work! MLP neurons are pretty good https://arxiv.org/abs/2601.22594

Nous Research@NousResearch

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

24d3.8K4726

Jiaxin Wen@jiaxinwen22

> To confirm that CNA ablation does not degrade general model capabilities, we evaluate *MMLU* accuracy

🐸

Nous Research@NousResearch

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

24d4.4K3711

Nous Research@NousResearch

What we find most useful about CNA is that the intervention is simple yet powerful. The steering is a multiplicative ablation on a sparse set of MLP neurons, which makes CNA a clean addition on top of standard production pipelines: instruction tuning, RL, and safety post-training. We find the neuron basis to be ripe for further exploration in interpretability and steering domains.

Paper: https://arxiv.org/abs/2605.12290 Blog: https://nousresearch.com/neuron-steering Code: https://github.com/NousResearch/neural-steering HF: https://huggingface.co/papers/2605.12290

CNA is a contribution from the mechanistic interpretability team at Nous. If you want to work on problems like this, find us on Discord.

24d2.6K178

Nous Research@NousResearch

Running the same CNA search on base models (before instruction tuning) yields a structurally similar set of neurons, but ablating them produces almost no behavioral change.

We read this as evidence that the refusal mechanism is not latent in the pretrained model. The structural substrate is there, but alignment fine-tuning is what wires it up as a behavioral gate.

24d3.8K183

Nous Research@NousResearch

CNA builds on contrastive-pair methods like CAA, where residual-stream activation differences are averaged between paired positive and negative prompts to construct a tunable control vector. CAA works at low steering strengths but degrades at high ones, where the residual-stream intervention begins to corrupt outputs unrelated to the target behavior.

CNA moves the contrastive comparison out of the residual stream and into the MLP neuron basis. For each neuron we record its down-projection activation at the last token across both prompt sets, then keep the top 0.1% by mean contrastive difference. We find that ablating that set reduces refusal rates at similar strengths where CAA would corrupt unrelated outputs.

24d2.4K302

Nous Research@NousResearch

To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering strengths. CAA-steered models lose MMLU performance as strength increases; CNA-steered models match the unsteered baseline at every strength tested.

We read this as evidence that the CNA-identified circuit is doing only what we intended: removing a specific behavior, with no measurable spillover into unrelated capabilities.

24d1.6K17

Pano Pouroullis@Pano_Pouroullis

@NousResearch Lovely paper … sat a session with chat and Claude and made some learning notes :)

Great research @NousResearch 🙏

24d9011

nightwing@yaboilyrical

@aryaman2020 we were very inspired by the work you’ve done! very glad you appreciate 🙂

24d484

Akshobya@albustime

@NousResearch @grok why would we want to isolate the top 0.1% of My Little Pony Neurons? Is it because they are all inside of AI researchers

24d1532

Jiaxin Wen@jiaxinwen22

@NousResearch obviously steering degrades quality

24d982

Justin Brooke ❤️‍🔥@IMJustinBrooke

@NousResearch @grok translate this to good ol’ country boy slang for me so I can smell what these Nous boys be cookin’ up.

24d901

nightwing@yaboilyrical

@jiaxinwen22 @NousResearch we specify that ‘quality’ here is referring to avoiding mode collapse / repetitive sampling in both the blog and paper, though i agree this output is less quality and this is an interesting finding, thanks for sharing!

24d272

Jonathan Vitela@JKVitela1

@NousResearch Something I'd like to see, coming from someone who is currently studying this area, is a small paragraph summing up everything contain in this post but explained in plain English so that someone, like myself could fully understand. Great post though! I am not afraid to admit IDK.

24d252

Jiaxin Wen@jiaxinwen22

@yaboilyrical @NousResearch > avoiding mode collapse / repetitive sampling

i think we both agree that this is not the goal for production-level methods

24d192

Alpha Trader 🧑‍💻@0xAlphaTrader

@NousResearch Nous doesn't stop cooking

24d1224

snav@qorprate

@JKVitela1 @NousResearch Blog post should hopefully help, feel free to ask if there's still questions! https://nousresearch.com/neuron-steering

24d82

nightwing@yaboilyrical

@jiaxinwen22 @NousResearch our claim here is that this method outperforms classical control vector steering (like CAA) on that basis. while SAE-based steering *might* provide better overall quality it's quite a bit more expensive to find features to steer on.

both are specified in detail in our paper

24d141

Dill@dillonrolnick

@NousResearch nice work mr. nightwing! @yaboilyrical @karan4d @qorprate

24d1093

Eren Suner@geren8te

@NousResearch this is the useful direction. steering behavior by isolating the circuit is way more practical than pretending every fix needs another giant fine-tune.

24d2292