REAP is fascinating. you can find people on huggingface using coding datasets as calibration to prune parts of big MoE models selectively. and the outcome is "it's fine on coding, but if you ask it about who bill clinton is, it has zero preserved knowledge of him whatsoever"
Prime Intellect's kalomaze says REAP pruning with coding datasets strips MoE models of general knowledge
Pruned models forgot basic facts like Bill Clinton's identity
Many users are excited about selectively pruning MoE models with coding datasets because it enables efficient domain adaptation and specialization while fitting large models on local GPUs.
No Digg Deeper questions have been answered for this story yet.
Most Activity
All according to keikaku btw this is the end result of a well-implemented fine-grained MoE, and all contemporary MoEs mostly inherit their shapes from this paper
REAP is fascinating. you can find people on huggingface using coding datasets as calibration to prune parts of big MoE models selectively. and the outcome is "it's fine on coding, but if you ask it about who bill clinton is, it has zero preserved knowledge of him whatsoever"
and in some cases it will continue to reason... more or less correctly? just from totally broken premises/a fractured map of world knowledge orthogonal bill clinton subnetwork lobotomy is possible without mech interp's involvement and without deliberate retraining
REAP is fascinating. you can find people on huggingface using coding datasets as calibration to prune parts of big MoE models selectively. and the outcome is "it's fine on coding, but if you ask it about who bill clinton is, it has zero preserved knowledge of him whatsoever"

the catch is that this would produce horrible historian AIs, but if you can context distill the memorization blocker circuit in prompt contexts where the user specifically asks for the knowledge circuits to be shut off... hmm...

in principle if you can harness this on purpose for world knowledge, you could RL models to be good at first principles prediction of stuff *from premises* rather than from *memorized knowledge* you'd have a perfectly verifiable massive set of things that really happened

@buildingcoolshi https://github.com/CerebrasResearch/reap

by this i mean, you can preserve hillary clinton by accident, not know of bill clinton, and the model does shit like "hm... perhaps bill is related or married to her...? or the user is confused and was remembering hillary...?"

This is the simplest way of explaining it.
I have looked at dozens of benchmarks, error patterns, and thousands of reasoning traces.
I’ve used 200M tokens of REAPs this month, and have a series of tweets showing how they build apps, 3d games, animations, and perform computer use, agentic work.
Basically what you get:
- 1.8x token burn on same prompts vs original - Increase in self doubt, strange reasoning logic - If done badly corruption in ascii and formatting of tables (I have solved this by calibrating on this kind of thing and performing mod surgery to restitch lost super experts - increased likelihood to looping on topics you didn’t preserve for - more often falls into attractors that cause it spiral generating the same token over and over
The argument I like to use:
- companies like zai have good coding training data - experts have lots of redundancy - the model often trains on more tokens and is RLed harder than small models (qwen… wtf)
So the 65% of the weights quantized to 4bits are more valuable than say Nemotron supers bf16 weights.
Often REAPs do 1-5% better on coding/agent benchmarks vs the original at the cost of -50 points on mmlu for example.
I also think this is all a compute problem if you pour more calibration, self distillation, recovery lora and router training you can get very close to original

@kalomaze That's probably a lack of regularization. Statistics doesn't like abrupt transitions.
@kalomaze It will know Al Gore though. He invented the internet.
REAP is fascinating. you can find people on huggingface using coding datasets as calibration to prune parts of big MoE models selectively. and the outcome is "it's fine on coding, but if you ask it about who bill clinton is, it has zero preserved knowledge of him whatsoever"

@kalomaze local model hyperspecialization

@kalomaze I get like 5 LLM papers when I search for REAP, which one are you referring to?

@Rafa_Schwinger @kalomaze but REAP is a router-type application. each of the experts are already internally consistent. the idea is to prune the irrelevant experts (to reduce size) for a task that you're fine tuning towards.
ultimately, the REAP process is creating a sample imbalance to force a bias.

the most ambitious version of this would be, like, "strip einstein-specific observations encoded as orthogonal knowledge and force it to rederive general relativity", assuming it works, but there's probably *some* degree of dark knowledge (in the Hinton sense) type entanglement

@kalomaze > where the user specifically asks for or where *you* dont want user to access some specific knowledge..

@kalomaze This is extremely hype wtf

@kalomaze Benjamin Marie ran evals for REAP . They have similar accuracy on many benchmarks but consume more tokens , i think KV cache will eat the returns for some models . https://kaitchup.substack.com/p/qwopus-and-reap-custom-qwen36-models

that creates an incoherent world-view. It's better if you keep something especially when they are ontologically close. For instance, if you want to create a Python model, you can put mostly a Python dataset, but also a bit of general programming, and then a little bit of STEM, and finally just a little bit of completely general knowledge. that will preserve coherence.

@CarnivalHotdog @kalomaze Each expert is internally consistently locally but not globally. Idk, it feels aggressive to me. I believe from first principles that it would be better to finetune it a bit, even if with a lora to be merged in, to smooth out any edges you get in the process

@kalomaze Reap models will be great at any task if that task / domain was targeted during producing the pruned version but token efficiency is still a problem to be solved in Reap models .

@Rafa_Schwinger @kalomaze what does coding have to do with clinton? i'm sure there's some obama in the coding set though. ie: "learn to code."
it's a mixture of experts, and you're zeroing out the non-discrete logical experts, like a person who's good at math but bad with names and dates.