/Tech10h ago

Prime Intellect's kalomaze says REAP pruning with coding datasets strips MoE models of general knowledge

Pruned models forgot basic facts like Bill Clinton's identity

224151113231.9K

#501

Original post

kalomaze@kalomaze#1213inTech

REAP is fascinating. you can find people on huggingface using coding datasets as calibration to prune parts of big MoE models selectively. and the outcome is "it's fine on coding, but if you ask it about who bill clinton is, it has zero preserved knowledge of him whatsoever"

4:19 PM · Jun 20, 2026 · 26.3K Views

Sentiment

Many users are excited about selectively pruning MoE models with coding datasets because it enables efficient domain adaptation and specialization while fitting large models on local GPUs.

Pos

85.7%

Neg

14.3%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4KBOOKMARKS14RETWEETS1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

All according to keikaku btw this is the end result of a well-implemented fine-grained MoE, and all contemporary MoEs mostly inherit their shapes from this paper

kalomaze@kalomaze

5h4K3514

LIKES78

kalomaze@kalomaze

and in some cases it will continue to reason... more or less correctly? just from totally broken premises/a fractured map of world knowledge orthogonal bill clinton subnetwork lobotomy is possible without mech interp's involvement and without deliberate retraining

kalomaze@kalomaze

9h1.9K781

REPLIES3

kalomaze@kalomaze

the catch is that this would produce horrible historian AIs, but if you can context distill the memorization blocker circuit in prompt contexts where the user specifically asks for the knowledge circuits to be shut off... hmm...

9h969287

kalomaze@kalomaze

in principle if you can harness this on purpose for world knowledge, you could RL models to be good at first principles prediction of stuff *from premises* rather than from *memorized knowledge* you'd have a perfectly verifiable massive set of things that really happened

9h1.1K356

kalomaze@kalomaze

@buildingcoolshi https://github.com/CerebrasResearch/reap

6h218125

kalomaze@kalomaze

by this i mean, you can preserve hillary clinton by accident, not know of bill clinton, and the model does shit like "hm... perhaps bill is related or married to her...? or the user is confused and was remembering hillary...?"

9h1.1K361

0xSero@0xSero

This is the simplest way of explaining it.

I have looked at dozens of benchmarks, error patterns, and thousands of reasoning traces.

I’ve used 200M tokens of REAPs this month, and have a series of tweets showing how they build apps, 3d games, animations, and perform computer use, agentic work.

Basically what you get:

- 1.8x token burn on same prompts vs original - Increase in self doubt, strange reasoning logic - If done badly corruption in ascii and formatting of tables (I have solved this by calibrating on this kind of thing and performing mod surgery to restitch lost super experts - increased likelihood to looping on topics you didn’t preserve for - more often falls into attractors that cause it spiral generating the same token over and over

The argument I like to use:

- companies like zai have good coding training data - experts have lots of redundancy - the model often trains on more tokens and is RLed harder than small models (qwen… wtf)

So the 65% of the weights quantized to 4bits are more valuable than say Nemotron supers bf16 weights.

Often REAPs do 1-5% better on coding/agent benchmarks vs the original at the cost of -50 points on mmlu for example.

I also think this is all a compute problem if you pour more calibration, self distillation, recovery lora and router training you can get very close to original

5h562102

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@kalomaze That's probably a lack of regularization. Statistics doesn't like abrupt transitions.

8h36321

Cody Blakeney@code_star

@kalomaze It will know Al Gore though. He invented the internet.

kalomaze@kalomaze

8h700120

Elliot Arledge@elliotarledge

@kalomaze local model hyperspecialization

8h1.2K31

dev@buildingcoolshi

@kalomaze I get like 5 LLM papers when I search for REAP, which one are you referring to?

6h2443

Carnival Hotdog@CarnivalHotdog

@Rafa_Schwinger @kalomaze but REAP is a router-type application. each of the experts are already internally consistent. the idea is to prune the irrelevant experts (to reduce size) for a task that you're fine tuning towards.

ultimately, the REAP process is creating a sample imbalance to force a bias.

8h27

kalomaze@kalomaze

the most ambitious version of this would be, like, "strip einstein-specific observations encoded as orthogonal knowledge and force it to rederive general relativity", assuming it works, but there's probably *some* degree of dark knowledge (in the Hinton sense) type entanglement

9h602

𝑘𝑒𝑟𝑛𝑒𝑙𝑡𝑟𝑖𝑐𝑘@kernel_trick

@kalomaze > where the user specifically asks for or where *you* dont want user to access some specific knowledge..

9h482

サメQCU@sameQCU

@kalomaze This is extremely hype wtf

9h481

Waleed Ahmad@WaleedAhmad1a10

@kalomaze Benjamin Marie ran evals for REAP . They have similar accuracy on many benchmarks but consume more tokens , i think KV cache will eat the returns for some models . https://kaitchup.substack.com/p/qwopus-and-reap-custom-qwen36-models

7h127

Rafa Schwinger 🇻🇦@Rafa_Schwinger

that creates an incoherent world-view. It's better if you keep something especially when they are ontologically close. For instance, if you want to create a Python model, you can put mostly a Python dataset, but also a bit of general programming, and then a little bit of STEM, and finally just a little bit of completely general knowledge. that will preserve coherence.

8h251

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@CarnivalHotdog @kalomaze Each expert is internally consistently locally but not globally. Idk, it feels aggressive to me. I believe from first principles that it would be better to finetune it a bit, even if with a lora to be merged in, to smooth out any edges you get in the process

7h161

Waleed Ahmad@WaleedAhmad1a10

@kalomaze Reap models will be great at any task if that task / domain was targeted during producing the pruned version but token efficiency is still a problem to be solved in Reap models .

7h30

Carnival Hotdog@CarnivalHotdog

@Rafa_Schwinger @kalomaze what does coding have to do with clinton? i'm sure there's some obama in the coding set though. ie: "learn to code."

it's a mixture of experts, and you're zeroing out the non-discrete logical experts, like a person who's good at math but bad with names and dates.

8h28